Skip to content

Conversation

Drowze
Copy link
Contributor

@Drowze Drowze commented Sep 1, 2025

What does this PR do?

On the Karafka integration, when distributed_tracing is on, the original trace was lost after iterating through the messages. This PR aims to fix that and also add span links linking the message traces with the parent consumer trace (so it's easier to find each other using the Datadog APM UI).

Fixes #4873

Motivation:
Fix and improve the Datadog Karafka support.

Change log entry

Additional Notes:

How to test the change?
See #4873 for a full testing snippet.

@Drowze Drowze requested review from a team as code owners September 1, 2025 13:33
@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Sep 1, 2025
@Drowze Drowze changed the title Fix missing original karafka trace after iterating Fix losing original karafka trace after iterating through the messages with distributed_tracing on Sep 1, 2025
Comment on lines +50 to +58
if parent_span && span.parent_id != parent_span.id
# add a link from the parent trace to the message span
span_link = Tracing::SpanLink.new(parent_trace_digest)
span.links << span_link

# add a link from the current trace to the parent span
span_link = Tracing::SpanLink.new(trace.to_digest)
parent_span.links << span_link
end
Copy link
Contributor Author

@Drowze Drowze Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know your thoughts here! I understand that "span links" are not very widely used here in the ruby sdk, but one challenge we have once we continue a trace is co-relating the span with its "original parent". I think span links are a great fit for solving that problem.

@Drowze
Copy link
Contributor Author

Drowze commented Sep 15, 2025

Hey @marcotc @ivoanjo can you have a quick look here please? 🙇

Would like to get this merged to not conflict with other Karafka/Waterdrop pull requests (e.g.: #4874 - and also I want to open a new PR soon about turning off distributed tracing on a per-topic basis, see here)

@ivoanjo
Copy link
Member

ivoanjo commented Sep 16, 2025

Err thanks for the patience, I've asked the team to take a look at this asap :)

@marcotc
Copy link
Member

marcotc commented Sep 17, 2025

@Drowze I took a first pass and I need to allocate proper time to understand how we best want to present this kind of workflow (span links, directly linking, something else).

I have scheduled time to take a good look at this next week.

@p-datadog
Copy link
Member

The repair to #4876 LGTM. Span links do not appear to be used in dd-trace-rb except in one place where an open telemetry trace is converted (mapped?) to a datadog trace. @Drowze how do you feel about moving the span link code into a separate PR and just leaving the fix in this PR?

@p-datadog
Copy link
Member

Regarding span links: I read https://docs.datadoghq.com/tracing/trace_collection/span_links/ to try to understand their usage. The documentation page says the span links are for relationships other than parent/child. In this PR they are used for parent/child relationship? What is the actual gain/benefit?

end

Tracing.trace(Ext::SPAN_MESSAGE_CONSUME) do |span|
Tracing.trace(Ext::SPAN_MESSAGE_CONSUME, continue_from: trace_digest) do |span, trace|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am not sure this is correct. Is continue_from really meant for parent/child relationship?

@p-datadog
Copy link
Member

Looking at this PR again I don't understand the fix. The problem is the parent trace gets reset somehow, therefore I would expect the fix to repair the reset. How is adding links fixing the reset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrations Involves tracing integrations tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Karafka's "worker.process" trace is lost after iterating the messages (when distributed tracing is enabled)
5 participants