Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve correlation impact for the switch from span events to (log) events in GenAI #1294

Closed
codefromthecrypt opened this issue Jul 29, 2024 · 9 comments

Comments

@codefromthecrypt
Copy link

Area(s)

area:gen-ai

Is your change request related to a problem? Please describe.

#980 prefers the (log) events api for prompt/completion data in cases including chat and tool operations in GenAI.

Formerly, this was done in span events, which have a direct causality link (of trace/spanID) regardless of whether a correlation genai scoped attribute was added to each event. While there are good reasons to use the (log) events api, there is no explicit link anymore, so it leaves this up to instrumentation to choose to add a correlating attribute or a link.

We don't yet have direct advice on how to handle this transition, even though in practice it won't manifest for months or longer. For example, python is the dominant language in GenAI and it lacks the Event/Events api for the time being. This means many will use the compatibility mode (span events) and so not even see this issue until this status quo changes.

Describe the solution you'd like

I would like for us to offer guidance either in the events document for genai, or link to a practices doc defined at a higher level. This can help avoid a situation where users need to use temporal correlation or employ backend tricks to stitch-together the (log) events caused by a span. It would avoid this, as implementors can adopt a strategy before the experience miss manifests in practice.

Describe alternatives you've considered

Leaving this out of scope is ok for experts who also have context in how this is designed. They can choose to use practices such as span links or find matching correlation IDs. However, this doesn't help the backends achieve something consistently.

For example, such special casing may work for a vendor or couple based on tacit practice sharing. The goal of the semantics, as I understand it, it to allow both sides of the spec, instrumentors and those collecting the data, to have coherency and consistency. Some stronger advice, particularly when spans are in use, would be helpful, even if in the future we may see (log) event only or (log) event + metrics only implementations.

Additional context

link discussing this directly #980 (comment)

@karthikscale3
Copy link
Contributor

Thanks for bringing this up. This definitely needs some guidance. Adding it to log events will break the link with the span. This is crucial because today, LLM observability focused vendors such as Langtrace, Traceloop etc. offer a critical feature as part of the SDK for recording user feedback on model responses as scores which is attached as an attributed to a new span and linked to the parent span of the trace where the span corresponding to the LLM api call was generated. This is further shown in the UI to understand the score given by the user for a particular LLM generation.

Without a reliable way to link spans with logs, this feature is at risk of being impacted which means migrating to logs will be a blocker for vendor instrumentations.

@lmolkova
Copy link
Contributor

/cc @open-telemetry/semconv-event-approvers

@trask
Copy link
Member

trask commented Jul 29, 2024

Events (and logs) do have an optional reference to a parent span (in the proto, spec, and API). Is this asking for something different from what exists already? Thanks

@lmolkova
Copy link
Contributor

lmolkova commented Jul 29, 2024

If I understand the concern:

Is the question here is whether it's still reasonable to have incoherence during the experimentation phase?

@codefromthecrypt
Copy link
Author

codefromthecrypt commented Jul 30, 2024

Events (and logs) do have an optional reference to a parent span (in the proto, spec, and API). Is this asking for something different from what exists already? Thanks

So, one ask I had to resolve this issue is "link to a practices doc defined at a higher level" I may be looking at the wrong docs. I was looking at the following:

So, I am looking at the wrong docs, if it is already mentioned about the transition from span events to (log) event API. specifically the correlation concern with spans. It didn't seem obvious to me this is already the case. Can you point me to what I missed?

@codefromthecrypt
Copy link
Author

Is the question here is whether it's still reasonable to have incoherence during the experimentation phase?

My goal wasn't about the abstract acceptance of incoherence, rather the specific case of transition from span events to log events, and to help folks implement that properly. The PR #980 can go in anyway, hence me forking this as a separate concern to follow-up on vs suggest that PR should not go in.

To paraphrase the description, I was asking for either practical concerns in the genAI text, or in a top-level text about the event/s API wrt span correlation. I'm told this already exists for event/s API, but I can't find it yet.

Once we find/create this content, I can make a follow-up PR to enhance the spec with either a link to that advice or inline if there isn't one.

Meanwhile, most instrumentation won't hit this anyway as python has no event/s api yet. That said, if folks implement this in in javascript/typescript they would have access to the Event/s api.
https://github.com/open-telemetry/opentelemetry-js/tree/main/experimental/packages/api-events

Both for other folks like @karthikscale3 and myself, I really just want concrete guidance. We are near, but not done yet. This is independent of community acceptance of things that break as we attempt to move things forward in an experimental SIG.

Hope this helps

@codefromthecrypt
Copy link
Author

ps as I think it is easier to do vs suggest, I'll raise a PR to the spec doc as I feel folks aren't getting my intent. It isn't if we mention there is a context field or not in docs. Rather, the impact of moving from span events, and that if you do that and don't use that field, you will not achieve the same correlation. I think adding this remark directly is easier than what I've attempted so far 🤞

@codefromthecrypt
Copy link
Author

here's what I was looking for, somewhere. I hope it clarifies my intent. Once we have something like this, we can link to it in GenAI and any other new spec that migrates from span events to (log) events without people having to think through the nuance of this topic. I'm happy to expand, etc. this is small just to get things started.

open-telemetry/opentelemetry-specification#4167

@codefromthecrypt
Copy link
Author

I'm going to close this out and re-open it when the topic leads to a concrete problem. I think that's a better use of our time vs trying to prevent one with a doc change. Thanks for the feedback @karthikscale3 @trask @lmolkova

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants