Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenAI (LLM): how to capture streaming #1170

Open
lmolkova opened this issue Jun 20, 2024 · 3 comments
Open

GenAI (LLM): how to capture streaming #1170

lmolkova opened this issue Jun 20, 2024 · 3 comments

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Jun 20, 2024

Some questions (and proposals) on capturing streaming LLM completions:

  1. Should the GenAI span cover the duration till the last token in case of streaming?
    • Yes, otherwise how do we capture completion, errors, usage, etc?
  2. Do we need an event when the first token comes? Or another span to capture duration-to-first token from the beginning?
    • This might be too verbose/not quite useful
  3. Do we need some indication on the span that it represents a streaming call?
  4. Do we need new metrics?
    • see Add LLM model server metrics #1103 for server streaming metrics:
      • Time-to-first-token
      • Time-to-next-token
      • Number of active streams would also be useful - streaming seems to be quite hard and error prone and users would appreciate knowing they don't close streams, don't read them to the end, etc.
  5. What should gen_ai.client.operation.duration capture?
    • same as span: time-to-last-token
@karthikscale3
Copy link
Contributor

Token Generation Latency is another metric that could be useful

@TaoChenOSU
Copy link
Contributor

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

Another option would be we recommend people to indicate streaming or non-streaming in the operation name, such as streaming chat for streaming and chat for non-streaming.

@lmolkova
Copy link
Contributor Author

lmolkova commented Oct 9, 2024

time-to-first-token and time-to-next-token could be hard to capture by some SDKs since a single chunk returned by some APIs may contain multiple tokens. Will time-to-first-response make more sense?

good catch! maybe time-to-first-chunk and time-to-next-chunk ?

@lmolkova lmolkova removed their assignment Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants