Fix the potential issue of duplicate targets #3742

mike9421 · 2025-02-23T16:51:11Z

Description:

There may be a problem of unexpected target duplication -- for example, users rewrite the address during relabeling.

Link to tracking Issue(s):

Resolves: Targets in the same group are assigned to the same OTel #3617

Testing:
add TestDiscovery_NewItem unit test function and adjust the relabel_test.go file to accommodate the new test

mike9421 · 2025-02-23T16:57:54Z

@swiatekm Hello, I've submitted this pull request. Could you please review it at your convenience? Thank you!

swiatekm · 2025-02-24T11:07:38Z

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

Make the TargetURL attribute private and start empty.
Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.
Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

mike9421 · 2025-03-02T15:37:28Z

The general processing flow of "Targets" is as follows:
Discovery -> NewItem -> SetTargets -> prehook -> assign

My considerations are as follows:

If deduplication is not performed before assigning the target, it may still lead to duplicate targets being scraped. consistent-hashing can ensure that targets are assigned to the same OTel instance by hashing the TargetURL after relabeling to avoid duplication, but least-weighted cannot ensure assignment to the same OTel instance.
The prehook is not currently required to be enabled, so in the above PR, all relabel operations have been moved to NewItem, and the RelabeledKeep field information is saved for relabelConfigTargetFilter.
After deduplication, the TargetURL used by consistent-hashing could have remained unchanged. However, I later found that the targets returned by Prometheus are unordered (The targets are obtained by allGroups traversing map), which means the target selected after deduplication may vary each time. Therefore, I added the TargetURLRelabeled field for consistent-hashing to perform hashing. This ensures that even if different targets are selected, they will still be assigned to the same OTel instance.

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

Make the TargetURL attribute private and start empty.

Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.

Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

@swiatekm Modifying the TargetURL alone cannot solve the potential issue of least-weighted strategy scraping duplicate targets. Additionally, since prometheusreceiver will perform relabeling again, the TargetURL provided to OTel must be the original TargetURL (i.e., original __address__).

However, the above PR does circumvent the component responsibility boundaries, and placing the relabel operation in the prehook is indeed more reasonable. Since the deduplication operation is very similar to the target discarding operation, I think the deduplication operation can be placed in the default filter (i.e., relabelConfigTargetFilter) first. Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook. I'd love to hear your thoughts on this.

swiatekm · 2025-03-03T12:30:50Z

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

mike9421 · 2025-03-06T15:17:46Z

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

Alright, this PR will be put on hold until your WIP is completed. Which PR corresponds to the WIP? I'd like to see if it resolves the issue.

swiatekm · 2025-03-06T15:45:04Z

#3777 is the PR. It doesn't resolve the issue, but it will make it much simpler to do so. I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

mike9421 · 2025-03-08T03:37:31Z

I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook.

swiatekm · 2025-03-08T15:56:45Z

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Yes, we might need to force it, or at least document that target duplication is possible when it's disabled. Right now it's enabled by default, so it wouldn't be too drastic of a change at least.

More broadly, we do want to look into doing all target relabeling in the target allocator (right now we just drop targets, but don't change the labels themselves). This is more complicated, but it would make us more compatible with Prometheus and resolve these duplication issues.

mike9421 · 2025-04-14T17:21:53Z

The updated PR enables passing the required hash value and enhances extensibility, allowing different deduplication strategies through the use of different hash calculation methods for different FilterStrategy. For now, I haven’t found a better approach to pass the relabeled data to func (t *Item) Hash() ItemHash.

jaronoff97

I don't love needing to set a function like this, I propose another idea that may be worth testing

cmd/otel-allocator/internal/prehook/relabel.go

…pen-telemetry#3617)

…argets (open-telemetry#3617)" This reverts commit e03a841.

…nd add unit tests(open-telemetry#3617)

…metry#3617)

…fault; dropping it is not a recommended practice 2. The Prometheus receiver in OpenTelemetry relies on 'job', and dropping it causes errors

mike9421 · 2025-08-02T20:29:12Z

@swiatekm @jaronoff97 Hi, could you please help review this PR when you have time and check if any updates are needed?

This PR:

Updates the scrape target labels to align with Prometheus-processed labels
Removes relabel_configs
Adds unit tests comparing the behavior with Prometheus

It’s a bit challenging to fully decouple these unit tests from Prometheus.
Thanks in advance for your review! 🙏

eenchev · 2025-09-09T19:29:03Z

@swiatekm @jaronoff97
This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime?

jaronoff97

I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience 🙇

github-actions · 2025-09-09T20:38:31Z

E2E Test Results

34 files ±0 223 suites +2 1h 59m 49s ⏱️ + 1m 51s
89 tests +1 89 ✅ +1 0 💤 ±0 0 ❌ ±0
227 runs +2 227 ✅ +2 0 💤 ±0 0 ❌ ±0

Results for commit 42078f2. ± Comparison against base commit e5c048b.

♻️ This comment has been updated with latest results.

mike9421 · 2025-09-11T09:45:15Z

I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience 🙇

Yes, the PR includes unit tests for the changes. I'll be fixing the linting issues and also adding some end-to-end tests to verify this over the next couple of days. Thank you!

linux-foundation-easycla · 2025-09-15T19:29:07Z

The committers listed above are authorized under a signed CLA.

✅ login: mike9421 / name: mike9421 (642b92d, b2659c9, 09d28a4, b715b56, d7d6292, 61eb412, 125489c, 6b40a04, 42078f2, 6617817, 3eff16b)
✅ login: jaronoff97 / name: Jacob Aronoff (37a78cb)

mike9421 · 2025-09-15T19:40:43Z

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

Fixed golint issues.
Removed incorrect assert statements.
Added end-to-end (e2e) tests.

…l to prevent failures caused by premature detection results

- Fixed target.GetNodeName() for per-node strategy - Simplified duplicate target detection logic - Updated unit tests to verify new behavior

pandoralink · 2025-09-22T02:48:09Z

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

Fixed golint issues.

Removed incorrect assert statements.

Added end-to-end (e2e) tests.

@jaronoff97 @swiatekm
This issue affects our Target Allocator usage too. Could you please review this MR or propose a workaround?

swiatekm

This change looks broadly correct to me. My main issue with it is that we'd be changing how relabeling works in the target allocator for all users, without the ability for them to change it back in the event that the change has unforeseen consequences or bugs. For me to approve this, it needs to be configurable, either via a feature flag or configuration option.

In the interest of moving things along, how about we do the following:

You extract the hash calculation changes into a separate PR. These are a straightforward bug fix, so we can merge them quickly, and removing them from this PR will make it easier to review.
In this PR, we remove the e2e tests, and make the new behavior contingent on a new config option which will remain hidden and only used in unit tests.
In a separate PR, we expose the new config option in the TargetAllocator CRD and add the e2e tests using it.

How does that sound? I know I'm asking for additional work, but I really don't want to make this change haphazardly.

swiatekm · 2025-09-22T15:38:19Z

tests/e2e-targetallocator/targetallocator-consistency/02-install.yaml

+      # Ensure metrics are flushed to the exporter at least once
+      batch:
+        send_batch_size: 1000
+        timeout: 20s


If we want to guarantee a flush, isn't it better to set a very low timeout? Or just not use the batch processor at all.

In this test, a single collector needs to scrape metrics from multiple prometheus targets, and different targets have varying scrape completion times. The verification script needs to wait until all target metrics are collected before performing consistency validation, so we use the batch processor to ensure data completeness. Setting a very low timeout could result in the verification only getting partial target data.

P.S. My previous comment was indeed not clear enough 😅

I think it'd be simpler to just have the verification script poll periodically and check if all the data is available. You can't really guarantee completeness with what you're doing now, and the test will likely be flaky on slow GHA runners.

swiatekm · 2025-09-22T15:41:11Z

tests/e2e-targetallocator/targetallocator-consistency/03-install.yaml

+metadata:
+  name: metrics-consistency-script
+data:
+  main.py: |


I wish this was achievable using only chainsaw's Prometheus metric parsing function. Or, if we do need the script, I'd prefer it live in a .py file in the tests directory and be imported here somehow. Python code as text inside a ConfigMap definition is highly unmaintainable.

You're right. I will split it into separate files and look for simpler approaches than the current validation script.

swiatekm · 2025-09-22T15:43:53Z

cmd/otel-allocator/internal/prehook/relabel.go

+			// These labels are typically required for correct scraping behavior and are expected to be retained after relabeling.:
+			//   - job
+			//   - __scrape_interval__
+			//   - __scrape_timeout__
+			//   - __scheme__
+			//   - __metrics_path__
+			// Prometheus adds these labels by default. Removing them via relabel_configs is considered invalid and is therefore ignored.
+			// For details, see:
+			// https://github.com/prometheus/prometheus/blob/e6cfa720fbe6280153fab13090a483dbd40bece3/scrape/target.go#L429


Is the intent of this comment to explain what relabel.Process does? It doesn't have anything to do with the code in this package.

Thanks for the feedback. You're right that this comment is confusing in its current placement.

The intent of this comment was to document a behavioral difference between Prometheus and the target allocator regarding relabeling of these specific labels. In Prometheus, attempting to remove these labels via relabel_configs would cause an error, while the target allocator allows it and can still scrape normally.

However, since this behavioral difference doesn't actually impact functionality (the target allocator works fine either way), and the comment is indeed misleading about what the local code does, I'll remove it in a follow-up change.

The key point was just that these labels shouldn't be dropped during relabeling to maintain consistency with Prometheus behavior, but since it doesn't break anything when they are dropped, the comment adds more confusion than value.

swiatekm · 2025-09-22T15:50:19Z

cmd/otel-allocator/internal/prehook/testutils.go

@@ -0,0 +1,168 @@
+// Copyright The OpenTelemetry Authors


Why can't this code live in relabel_test.go?

Good point! You're right - since these functions are only used by relabel_test.go, they should live there instead of in a separate utility file. I'll move them over in the next update.

mike9421 requested a review from a team as a code owner February 23, 2025 16:51

mike9421 force-pushed the bugfix/potential_duplication branch from 758630b to e03a841 Compare April 14, 2025 17:03

jaronoff97 reviewed Apr 19, 2025

View reviewed changes

cmd/otel-allocator/internal/prehook/relabel.go Outdated Show resolved Hide resolved

fix(target-allocator): fix the potential issue of duplicate targets (o…

09d28a4

…pen-telemetry#3617)

mike9421 force-pushed the bugfix/potential_duplication branch from 4ea9aa1 to 09d28a4 Compare May 18, 2025 18:54

mike9421 added 3 commits May 19, 2025 02:57

Revert "fix(target-allocator): fix the potential issue of duplicate t…

b2659c9

…argets (open-telemetry#3617)" This reverts commit e03a841.

feat: update labels to relabeled values. TODO: Update relabelConfig a…

d7d6292

…nd add unit tests(open-telemetry#3617)

Tests pass locally. TODO: add unit tests and benchmark test(open-tele…

6b40a04

…metry#3617)

jdgeisler mentioned this pull request Jul 24, 2025

Targets in the same group are assigned to the same OTel #3617

Open

mike9421 and others added 3 commits July 26, 2025 00:42

feat: removed support for dropping job 1. Prometheus adds 'job' by de…

61eb412

…fault; dropping it is not a recommended practice 2. The Prometheus receiver in OpenTelemetry relies on 'job', and dropping it causes errors

Merge branch 'main' into bugfix/potential_duplication

3eff16b

chore: add unit test and changelog

642b92d

jaronoff97 approved these changes Sep 9, 2025

View reviewed changes

Merge branch 'main' into bugfix/potential_duplication

37a78cb

feat: add end-to-end tests to ensure consistency with Prometheus

b715b56

mike9421 force-pushed the bugfix/potential_duplication branch from 2fad63d to b715b56 Compare September 15, 2025 19:36

mike9421 and others added 3 commits September 16, 2025 16:37

Merge branch 'main' into bugfix/potential_duplication

6617817

chore(e2e-targetallocator-consistency): add timeout for 03-assert.yam…

125489c

…l to prevent failures caused by premature detection results

enhance: simplify target processing and fix target.GetNodeName()

42078f2

- Fixed target.GetNodeName() for per-node strategy - Simplified duplicate target detection logic - Updated unit tests to verify new behavior

mike9421 requested review from jaronoff97 and swiatekm September 18, 2025 15:19

swiatekm reviewed Sep 22, 2025

View reviewed changes

Fix the potential issue of duplicate targets #3742

Are you sure you want to change the base?

Fix the potential issue of duplicate targets #3742

Conversation

mike9421 commented Feb 23, 2025

Uh oh!

mike9421 commented Feb 23, 2025

Uh oh!

swiatekm commented Feb 24, 2025

Uh oh!

mike9421 commented Mar 2, 2025

Uh oh!

swiatekm commented Mar 3, 2025

Uh oh!

mike9421 commented Mar 6, 2025

Uh oh!

swiatekm commented Mar 6, 2025

Uh oh!

mike9421 commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swiatekm commented Mar 8, 2025

Uh oh!

mike9421 commented Apr 14, 2025

Uh oh!

jaronoff97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mike9421 commented Aug 2, 2025

Uh oh!

eenchev commented Sep 9, 2025

Uh oh!

jaronoff97 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

mike9421 commented Sep 11, 2025

Uh oh!

linux-foundation-easycla bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mike9421 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pandoralink commented Sep 22, 2025

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swiatekm Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mike9421 commented Mar 8, 2025 •

edited

Loading

github-actions bot commented Sep 9, 2025 •

edited

Loading

linux-foundation-easycla bot commented Sep 15, 2025 •

edited

Loading

mike9421 commented Sep 15, 2025 •

edited

Loading

swiatekm Sep 22, 2025 •

edited

Loading