-
Notifications
You must be signed in to change notification settings - Fork 550
Fix the potential issue of duplicate targets #3742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix the potential issue of duplicate targets #3742
Conversation
@swiatekm Hello, I've submitted this pull request. Could you please review it at your convenience? Thank you! |
This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook. Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:
|
The general processing flow of "Targets" is as follows: My considerations are as follows:
@swiatekm Modifying the However, the above PR does circumvent the component responsibility boundaries, and placing the relabel operation in the prehook is indeed more reasonable. Since the deduplication operation is very similar to the target discarding operation, I think the deduplication operation can be placed in the default filter (i.e., |
Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421? If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok? |
Alright, this PR will be put on hold until your WIP is completed. Which PR corresponds to the WIP? I'd like to see if it resolves the issue. |
#3777 is the PR. It doesn't resolve the issue, but it will make it much simpler to do so. I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it. |
Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:
|
Yes, we might need to force it, or at least document that target duplication is possible when it's disabled. Right now it's enabled by default, so it wouldn't be too drastic of a change at least. More broadly, we do want to look into doing all target relabeling in the target allocator (right now we just drop targets, but don't change the labels themselves). This is more complicated, but it would make us more compatible with Prometheus and resolve these duplication issues. |
758630b
to
e03a841
Compare
The updated PR enables passing the required hash value and enhances extensibility, allowing different deduplication strategies through the use of different hash calculation methods for different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love needing to set a function like this, I propose another idea that may be worth testing
4ea9aa1
to
09d28a4
Compare
…argets (open-telemetry#3617)" This reverts commit e03a841.
…fault; dropping it is not a recommended practice 2. The Prometheus receiver in OpenTelemetry relies on 'job', and dropping it causes errors
@swiatekm @jaronoff97 Hi, could you please help review this PR when you have time and check if any updates are needed? This PR:
It’s a bit challenging to fully decouple these unit tests from Prometheus. |
@swiatekm @jaronoff97 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience 🙇
Yes, the PR includes unit tests for the changes. I'll be fixing the linting issues and also adding some end-to-end tests to verify this over the next couple of days. Thank you! |
2fad63d
to
b715b56
Compare
Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:
|
…l to prevent failures caused by premature detection results
- Fixed target.GetNodeName() for per-node strategy - Simplified duplicate target detection logic - Updated unit tests to verify new behavior
@jaronoff97 @swiatekm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks broadly correct to me. My main issue with it is that we'd be changing how relabeling works in the target allocator for all users, without the ability for them to change it back in the event that the change has unforeseen consequences or bugs. For me to approve this, it needs to be configurable, either via a feature flag or configuration option.
In the interest of moving things along, how about we do the following:
- You extract the hash calculation changes into a separate PR. These are a straightforward bug fix, so we can merge them quickly, and removing them from this PR will make it easier to review.
- In this PR, we remove the e2e tests, and make the new behavior contingent on a new config option which will remain hidden and only used in unit tests.
- In a separate PR, we expose the new config option in the TargetAllocator CRD and add the e2e tests using it.
How does that sound? I know I'm asking for additional work, but I really don't want to make this change haphazardly.
# Ensure metrics are flushed to the exporter at least once | ||
batch: | ||
send_batch_size: 1000 | ||
timeout: 20s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to guarantee a flush, isn't it better to set a very low timeout? Or just not use the batch processor at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, a single collector needs to scrape metrics from multiple prometheus targets, and different targets have varying scrape completion times. The verification script needs to wait until all target metrics are collected before performing consistency validation, so we use the batch processor to ensure data completeness. Setting a very low timeout could result in the verification only getting partial target data.
P.S. My previous comment was indeed not clear enough 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be simpler to just have the verification script poll periodically and check if all the data is available. You can't really guarantee completeness with what you're doing now, and the test will likely be flaky on slow GHA runners.
metadata: | ||
name: metrics-consistency-script | ||
data: | ||
main.py: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish this was achievable using only chainsaw's Prometheus metric parsing function. Or, if we do need the script, I'd prefer it live in a .py file in the tests directory and be imported here somehow. Python code as text inside a ConfigMap definition is highly unmaintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I will split it into separate files and look for simpler approaches than the current validation script.
// These labels are typically required for correct scraping behavior and are expected to be retained after relabeling.: | ||
// - job | ||
// - __scrape_interval__ | ||
// - __scrape_timeout__ | ||
// - __scheme__ | ||
// - __metrics_path__ | ||
// Prometheus adds these labels by default. Removing them via relabel_configs is considered invalid and is therefore ignored. | ||
// For details, see: | ||
// https://github.com/prometheus/prometheus/blob/e6cfa720fbe6280153fab13090a483dbd40bece3/scrape/target.go#L429 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent of this comment to explain what relabel.Process
does? It doesn't have anything to do with the code in this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. You're right that this comment is confusing in its current placement.
The intent of this comment was to document a behavioral difference between Prometheus and the target allocator regarding relabeling of these specific labels. In Prometheus, attempting to remove these labels via relabel_configs would cause an error, while the target allocator allows it and can still scrape normally.
However, since this behavioral difference doesn't actually impact functionality (the target allocator works fine either way), and the comment is indeed misleading about what the local code does, I'll remove it in a follow-up change.
The key point was just that these labels shouldn't be dropped during relabeling to maintain consistency with Prometheus behavior, but since it doesn't break anything when they are dropped, the comment adds more confusion than value.
@@ -0,0 +1,168 @@ | |||
// Copyright The OpenTelemetry Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this code live in relabel_test.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! You're right - since these functions are only used by relabel_test.go, they should live there instead of in a separate utility file. I'll move them over in the next update.
Description:
There may be a problem of unexpected target duplication -- for example, users rewrite the address during relabeling.
Link to tracking Issue(s):
Testing:
add
TestDiscovery_NewItem
unit test function and adjust the relabel_test.go file to accommodate the new test