Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(anthropic): add thinking as a separate completion message #2780

Merged
merged 2 commits into from
Mar 24, 2025

Conversation

dinmukhamedm
Copy link
Contributor

@dinmukhamedm dinmukhamedm commented Mar 15, 2025

Addresses #2774

This PR introduces many small changes:

  1. 🚀 Feature: Add thinking tokens to the instrumentation #2774 itself, i.e. add a thinking block as gen_ai.completion.0.role = "thinking" and gen_ai.completion.0.content for both streaming and non-streaming.
  2. Updates anthropic SDK dependency to support streaming in test. As a result, the following few changes:
    • Handle the case when cache_creation_input_tokens and cache_read_input_tokens is explicitly set to None. I think this was not the case in older SDK, and this was breaking token usage on both spans and metrics
    • count_tokens is not available in the SDK itself anymore. There is a anthropic.beta.messages.count_tokens() now that accepts a messages array and a model name (starting from claude-2), but it is actually a call to the API – I don't think we need to impose additional latency and network
  3. Tiny refactor around how cache tokens are counted
  4. Split the test file and move cassettes accordingly – thus, the huge change
  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

Important

This PR adds 'thinking' as a separate completion message in the Anthropic integration, updates the SDK, and refactors tests to support these changes.

  • Behavior:
    • Adds 'thinking' as a separate completion message with gen_ai.completion.0.role = "thinking" in __init__.py.
    • Updates anthropic SDK dependency to support streaming.
    • Handles None values for cache_creation_input_tokens and cache_read_input_tokens in __init__.py.
    • Removes count_tokens usage due to SDK changes.
  • Tests:
    • Splits and refactors tests into test_messages.py, test_prompt_caching.py, and test_thinking.py.
    • Adds new test cases for 'thinking' in test_thinking.py.
    • Moves cassettes to new test files.
  • Misc:
    • Refactors token counting logic in __init__.py.
    • Updates verify_metrics in utils.py to handle new metrics.

This description was created by Ellipsis for 7b39ca0. It will automatically update as commits are pushed.

Sorry, something went wrong.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 7b39ca0 in 2 minutes and 44 seconds

More details
  • Looked at 6172 lines of code in 23 files
  • Skipped 1 files when reviewing.
  • Skipped posting 10 drafted comments based on config settings.
1. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:1
  • Draft comment:
    Good concise test utility. The asserts check required metrics attributes correctly.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
2. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:25
  • Draft comment:
    Consider adding a brief comment to explain the purpose of 'ignore_zero_input_tokens' flag.
  • Reason this comment was not posted:
    Confidence changes required: 30% <= threshold 50%
    None
3. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:25
  • Draft comment:
    Consider adding custom error messages to your assert statements for better debugging clarity. For example, when asserting that data_point.sum > 0, a message like 'Expected positive token sum for input tokens but got {data_point.sum}' would help.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50%
    None
4. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:58
  • Draft comment:
    All metric data points are forced to have gen_ai.system == 'anthropic'. This assert is concise and correct, but if future extensions support other systems, consider parameterizing this check.
  • Reason this comment was not posted:
    Confidence changes required: 30% <= threshold 50%
    None
5. packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py:229
  • Draft comment:
    Typo in comment: 'Antrhopic' should be corrected to 'Anthropic' for consistency.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
6. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:267
  • Draft comment:
    Typo detected: The text segment "getting starte" is missing a 'd'. Please update it to "getting started".
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
7. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:396
  • Draft comment:
    Typo detected: The streamed text fragments "a ligh" followed by "thearted introduction" seem to be intended as "a lighthearted introduction". Please check and correct the concatenation.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:437
  • Draft comment:
    Typo detected: The contraction "you''" appears to be incorrect and should be "you'd". Additionally, the phrase that is split into "hear any other tech" and then "theme" and "d jokes!" seems intended to be "tech-themed jokes!". Please review the streaming text concatenation for proper word splits.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_tools.yaml:38
  • Draft comment:
    Minor typographical note: the operating system value 'MacOS' is used, but the conventional spelling is 'macOS'. It's a trivial change for consistency.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
10. packages/opentelemetry-instrumentation-anthropic/tests/test_prompt_caching.py:79
  • Draft comment:
    Typo: The word 'cassete' should be corrected to 'cassette' in the comment. Please update it consistently across the file.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
    While technically correct that "cassete" is misspelled, this is just a typo in comments that doesn't affect functionality. Comments are documentation and should ideally be correct, but this is an extremely minor issue that doesn't impact understanding. The rules say not to make comments that are obvious or unimportant.
    The typo does appear consistently throughout the file, so fixing it would improve documentation quality. Comments are part of code maintenance.
    However, this is an extremely low-impact issue. The meaning is still clear despite the typo. This kind of nitpick creates noise in PR reviews without adding meaningful value.
    While technically correct, this comment about a minor typo in documentation is too trivial to be worth keeping in a PR review.

Workflow ID: wflow_nyEKVvqQmgvvdH2u


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@dinmukhamedm
Copy link
Contributor Author

@nirga any thoughts/feedback?

Copy link
Member

@nirga nirga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay @dinmukhamedm! I looked at it on my phone and I was like "ok I need my laptop for this it's a huge PR" and then forgot about it 😨
looks great, just had a super small nit comment - can you fix?

@@ -52,8 +55,8 @@ def _set_token_usage(
token_histogram: Histogram = None,
choice_counter: Counter = None,
):
cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0)
cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0)
cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0) or 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need this since get returns 0 by default

Copy link
Contributor Author

@dinmukhamedm dinmukhamedm Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the thing is that the new anthropic SDK explicitly sets these fields to None, and so get returns a None, which fails below, where we try to add things up.

>>> d = {"key": "val", "none_key": None}
>>> v = d.get("none_key", "fallback")
>>> print (v)
None
>>> 

same for all the changes below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh
I don't know if I hate python more or Anthropic more. Thanks!

cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0)
cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0)
cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0) or 0
cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0) or 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -159,13 +164,13 @@ def build_from_streaming_response(
completion_tokens = -1
# prompt_usage
if usage := complete_response.get("usage"):
prompt_tokens = usage.get("input_tokens", 0)
prompt_tokens = usage.get("input_tokens", 0) or 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same?

@dinmukhamedm
Copy link
Contributor Author

No worries @nirga ! Responded to the comment

@nirga nirga merged commit e65915c into traceloop:main Mar 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants