Skip to content

Conversation

mike9421
Copy link

Description:

There may be a problem of unexpected target duplication -- for example, users rewrite the address during relabeling.

Link to tracking Issue(s):

Testing:
add TestDiscovery_NewItem unit test function and adjust the relabel_test.go file to accommodate the new test

@mike9421 mike9421 requested a review from a team as a code owner February 23, 2025 16:51
@mike9421
Copy link
Author

@swiatekm Hello, I've submitted this pull request. Could you please review it at your convenience? Thank you!

@swiatekm
Copy link
Contributor

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

  1. Make the TargetURL attribute private and start empty.
  2. Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.
  3. Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

@mike9421
Copy link
Author

mike9421 commented Mar 2, 2025

The general processing flow of "Targets" is as follows:
Discovery -> NewItem -> SetTargets -> prehook -> assign

My considerations are as follows:

  • If deduplication is not performed before assigning the target, it may still lead to duplicate targets being scraped. consistent-hashing can ensure that targets are assigned to the same OTel instance by hashing the TargetURL after relabeling to avoid duplication, but least-weighted cannot ensure assignment to the same OTel instance.
  • The prehook is not currently required to be enabled, so in the above PR, all relabel operations have been moved to NewItem, and the RelabeledKeep field information is saved for relabelConfigTargetFilter.
  • After deduplication, the TargetURL used by consistent-hashing could have remained unchanged. However, I later found that the targets returned by Prometheus are unordered (The targets are obtained by allGroups traversing map), which means the target selected after deduplication may vary each time. Therefore, I added the TargetURLRelabeled field for consistent-hashing to perform hashing. This ensures that even if different targets are selected, they will still be assigned to the same OTel instance.

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

  1. Make the TargetURL attribute private and start empty.
  2. Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.
  3. Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

@swiatekm Modifying the TargetURL alone cannot solve the potential issue of least-weighted strategy scraping duplicate targets. Additionally, since prometheusreceiver will perform relabeling again, the TargetURL provided to OTel must be the original TargetURL (i.e., original __address__).

However, the above PR does circumvent the component responsibility boundaries, and placing the relabel operation in the prehook is indeed more reasonable. Since the deduplication operation is very similar to the target discarding operation, I think the deduplication operation can be placed in the default filter (i.e., relabelConfigTargetFilter) first. Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook. I'd love to hear your thoughts on this.

@swiatekm
Copy link
Contributor

swiatekm commented Mar 3, 2025

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

@mike9421
Copy link
Author

mike9421 commented Mar 6, 2025

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

Alright, this PR will be put on hold until your WIP is completed. Which PR corresponds to the WIP? I'd like to see if it resolves the issue.

@swiatekm
Copy link
Contributor

swiatekm commented Mar 6, 2025

#3777 is the PR. It doesn't resolve the issue, but it will make it much simpler to do so. I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

@mike9421
Copy link
Author

mike9421 commented Mar 8, 2025

I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook.

@swiatekm
Copy link
Contributor

swiatekm commented Mar 8, 2025

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Yes, we might need to force it, or at least document that target duplication is possible when it's disabled. Right now it's enabled by default, so it wouldn't be too drastic of a change at least.

More broadly, we do want to look into doing all target relabeling in the target allocator (right now we just drop targets, but don't change the labels themselves). This is more complicated, but it would make us more compatible with Prometheus and resolve these duplication issues.

@mike9421 mike9421 force-pushed the bugfix/potential_duplication branch from 758630b to e03a841 Compare April 14, 2025 17:03
@mike9421
Copy link
Author

The updated PR enables passing the required hash value and enhances extensibility, allowing different deduplication strategies through the use of different hash calculation methods for different FilterStrategy. For now, I haven’t found a better approach to pass the relabeled data to func (t *Item) Hash() ItemHash.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love needing to set a function like this, I propose another idea that may be worth testing

@mike9421 mike9421 force-pushed the bugfix/potential_duplication branch from 4ea9aa1 to 09d28a4 Compare May 18, 2025 18:54
mike9421 and others added 3 commits July 26, 2025 00:42
…fault; dropping it is not a recommended practice

2. The Prometheus receiver in OpenTelemetry relies on 'job', and dropping it causes errors
@mike9421
Copy link
Author

mike9421 commented Aug 2, 2025

@swiatekm @jaronoff97 Hi, could you please help review this PR when you have time and check if any updates are needed?

This PR:

  • Updates the scrape target labels to align with Prometheus-processed labels
  • Removes relabel_configs
  • Adds unit tests comparing the behavior with Prometheus

It’s a bit challenging to fully decouple these unit tests from Prometheus.
Thanks in advance for your review! 🙏

@eenchev
Copy link

eenchev commented Sep 9, 2025

@swiatekm @jaronoff97
This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime?

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience 🙇

Copy link
Contributor

github-actions bot commented Sep 9, 2025

E2E Test Results

 34 files  ±0  223 suites  +2   1h 59m 49s ⏱️ + 1m 51s
 89 tests +1   89 ✅ +1  0 💤 ±0  0 ❌ ±0 
227 runs  +2  227 ✅ +2  0 💤 ±0  0 ❌ ±0 

Results for commit 42078f2. ± Comparison against base commit e5c048b.

♻️ This comment has been updated with latest results.

@mike9421
Copy link
Author

I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience 🙇

Yes, the PR includes unit tests for the changes. I'll be fixing the linting issues and also adding some end-to-end tests to verify this over the next couple of days. Thank you!

Copy link

linux-foundation-easycla bot commented Sep 15, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@mike9421 mike9421 force-pushed the bugfix/potential_duplication branch from 2fad63d to b715b56 Compare September 15, 2025 19:36
@mike9421
Copy link
Author

mike9421 commented Sep 15, 2025

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

  • Fixed golint issues.

  • Removed incorrect assert statements.

  • Added end-to-end (e2e) tests.

mike9421 and others added 3 commits September 16, 2025 16:37
…l to prevent failures caused by premature detection results
- Fixed target.GetNodeName() for per-node strategy
- Simplified duplicate target detection logic
- Updated unit tests to verify new behavior
@pandoralink
Copy link

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

  • Fixed golint issues.
  • Removed incorrect assert statements.
  • Added end-to-end (e2e) tests.

@jaronoff97 @swiatekm
This issue affects our Target Allocator usage too. Could you please review this MR or propose a workaround?

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks broadly correct to me. My main issue with it is that we'd be changing how relabeling works in the target allocator for all users, without the ability for them to change it back in the event that the change has unforeseen consequences or bugs. For me to approve this, it needs to be configurable, either via a feature flag or configuration option.

In the interest of moving things along, how about we do the following:

  1. You extract the hash calculation changes into a separate PR. These are a straightforward bug fix, so we can merge them quickly, and removing them from this PR will make it easier to review.
  2. In this PR, we remove the e2e tests, and make the new behavior contingent on a new config option which will remain hidden and only used in unit tests.
  3. In a separate PR, we expose the new config option in the TargetAllocator CRD and add the e2e tests using it.

How does that sound? I know I'm asking for additional work, but I really don't want to make this change haphazardly.

Comment on lines +79 to +82
# Ensure metrics are flushed to the exporter at least once
batch:
send_batch_size: 1000
timeout: 20s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to guarantee a flush, isn't it better to set a very low timeout? Or just not use the batch processor at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, a single collector needs to scrape metrics from multiple prometheus targets, and different targets have varying scrape completion times. The verification script needs to wait until all target metrics are collected before performing consistency validation, so we use the batch processor to ensure data completeness. Setting a very low timeout could result in the verification only getting partial target data.

P.S. My previous comment was indeed not clear enough 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be simpler to just have the verification script poll periodically and check if all the data is available. You can't really guarantee completeness with what you're doing now, and the test will likely be flaky on slow GHA runners.

metadata:
name: metrics-consistency-script
data:
main.py: |
Copy link
Contributor

@swiatekm swiatekm Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish this was achievable using only chainsaw's Prometheus metric parsing function. Or, if we do need the script, I'd prefer it live in a .py file in the tests directory and be imported here somehow. Python code as text inside a ConfigMap definition is highly unmaintainable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I will split it into separate files and look for simpler approaches than the current validation script.

Comment on lines +45 to +53
// These labels are typically required for correct scraping behavior and are expected to be retained after relabeling.:
// - job
// - __scrape_interval__
// - __scrape_timeout__
// - __scheme__
// - __metrics_path__
// Prometheus adds these labels by default. Removing them via relabel_configs is considered invalid and is therefore ignored.
// For details, see:
// https://github.com/prometheus/prometheus/blob/e6cfa720fbe6280153fab13090a483dbd40bece3/scrape/target.go#L429
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent of this comment to explain what relabel.Process does? It doesn't have anything to do with the code in this package.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. You're right that this comment is confusing in its current placement.

The intent of this comment was to document a behavioral difference between Prometheus and the target allocator regarding relabeling of these specific labels. In Prometheus, attempting to remove these labels via relabel_configs would cause an error, while the target allocator allows it and can still scrape normally.

However, since this behavioral difference doesn't actually impact functionality (the target allocator works fine either way), and the comment is indeed misleading about what the local code does, I'll remove it in a follow-up change.

The key point was just that these labels shouldn't be dropped during relabeling to maintain consistency with Prometheus behavior, but since it doesn't break anything when they are dropped, the comment adds more confusion than value.

@@ -0,0 +1,168 @@
// Copyright The OpenTelemetry Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this code live in relabel_test.go?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! You're right - since these functions are only used by relabel_test.go, they should live there instead of in a separate utility file. I'll move them over in the next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Targets in the same group are assigned to the same OTel
5 participants