fix(anthropic): add thinking as a separate completion message #2780

dinmukhamedm · 2025-03-15T15:19:48Z

Addresses #2774

This PR introduces many small changes:

🚀 Feature: Add thinking tokens to the instrumentation #2774 itself, i.e. add a thinking block as gen_ai.completion.0.role = "thinking" and gen_ai.completion.0.content for both streaming and non-streaming.
Updates anthropic SDK dependency to support streaming in test. As a result, the following few changes:
- Handle the case when cache_creation_input_tokens and cache_read_input_tokens is explicitly set to None. I think this was not the case in older SDK, and this was breaking token usage on both spans and metrics
- count_tokens is not available in the SDK itself anymore. There is a anthropic.beta.messages.count_tokens() now that accepts a messages array and a model name (starting from claude-2), but it is actually a call to the API – I don't think we need to impose additional latency and network
Tiny refactor around how cache tokens are counted
Split the test file and move cassettes accordingly – thus, the huge change

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

This PR adds 'thinking' as a separate completion message in the Anthropic integration, updates the SDK, and refactors tests to support these changes.

Behavior:
- Adds 'thinking' as a separate completion message with gen_ai.completion.0.role = "thinking" in __init__.py.
- Updates anthropic SDK dependency to support streaming.
- Handles None values for cache_creation_input_tokens and cache_read_input_tokens in __init__.py.
- Removes count_tokens usage due to SDK changes.
Tests:
- Splits and refactors tests into test_messages.py, test_prompt_caching.py, and test_thinking.py.
- Adds new test cases for 'thinking' in test_thinking.py.
- Moves cassettes to new test files.
Misc:
- Refactors token counting logic in __init__.py.
- Updates verify_metrics in utils.py to handle new metrics.

^{This description was created by}^{for 7b39ca0. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 7b39ca0 in 2 minutes and 44 seconds

More details

Looked at 6172 lines of code in 23 files
Skipped 1 files when reviewing.
Skipped posting 10 drafted comments based on config settings.

1. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:1

Draft comment:
Good concise test utility. The asserts check required metrics attributes correctly.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

2. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:25

Draft comment:
Consider adding a brief comment to explain the purpose of 'ignore_zero_input_tokens' flag.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50%
None

3. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:25

Draft comment:
Consider adding custom error messages to your assert statements for better debugging clarity. For example, when asserting that data_point.sum > 0, a message like 'Expected positive token sum for input tokens but got {data_point.sum}' would help.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

4. packages/opentelemetry-instrumentation-anthropic/tests/utils.py:58

Draft comment:
All metric data points are forced to have gen_ai.system == 'anthropic'. This assert is concise and correct, but if future extensions support other systems, consider parameterizing this check.
Reason this comment was not posted:
Confidence changes required: 30% <= threshold 50%
None

5. packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.py:229

Draft comment:
Typo in comment: 'Antrhopic' should be corrected to 'Anthropic' for consistency.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

6. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:267

Draft comment:
Typo detected: The text segment "getting starte" is missing a 'd'. Please update it to "getting started".
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

7. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:396

Draft comment:
Typo detected: The streamed text fragments "a ligh" followed by "thearted introduction" seem to be intended as "a lighthearted introduction". Please check and correct the concatenation.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_message_streaming.yaml:437

Draft comment:
Typo detected: The contraction "you''" appears to be incorrect and should be "you'd". Additionally, the phrase that is split into "hear any other tech" and then "theme" and "d jokes!" seems intended to be "tech-themed jokes!". Please review the streaming text concatenation for proper word splits.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

9. packages/opentelemetry-instrumentation-anthropic/tests/cassettes/test_messages/test_anthropic_tools.yaml:38

Draft comment:
Minor typographical note: the operating system value 'MacOS' is used, but the conventional spelling is 'macOS'. It's a trivial change for consistency.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

10. packages/opentelemetry-instrumentation-anthropic/tests/test_prompt_caching.py:79

Draft comment:
Typo: The word 'cassete' should be corrected to 'cassette' in the comment. Please update it consistently across the file.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
While technically correct that "cassete" is misspelled, this is just a typo in comments that doesn't affect functionality. Comments are documentation and should ideally be correct, but this is an extremely minor issue that doesn't impact understanding. The rules say not to make comments that are obvious or unimportant.
The typo does appear consistently throughout the file, so fixing it would improve documentation quality. Comments are part of code maintenance.
However, this is an extremely low-impact issue. The meaning is still clear despite the typo. This kind of nitpick creates noise in PR reviews without adding meaningful value.
While technically correct, this comment about a minor typo in documentation is too trivial to be worth keeping in a PR review.

Workflow ID: wflow_nyEKVvqQmgvvdH2u

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

dinmukhamedm · 2025-03-23T16:35:01Z

@nirga any thoughts/feedback?

nirga

sorry for the delay @dinmukhamedm! I looked at it on my phone and I was like "ok I need my laptop for this it's a huge PR" and then forgot about it 😨
looks great, just had a super small nit comment - can you fix?

nirga · 2025-03-23T20:40:40Z

...opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py

@@ -52,8 +55,8 @@ def _set_token_usage(
    token_histogram: Histogram = None,
    choice_counter: Counter = None,
 ):
-    cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0)
-    cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0)
+    cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0) or 0


nit: you don't need this since get returns 0 by default

No, the thing is that the new anthropic SDK explicitly sets these fields to None, and so get returns a None, which fails below, where we try to add things up.

>>> d = {"key": "val", "none_key": None} >>> v = d.get("none_key", "fallback") >>> print (v) None >>>

same for all the changes below

Oh
I don't know if I hate python more or Anthropic more. Thanks!

nirga · 2025-03-23T20:40:45Z

...opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py

-    cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0)
-    cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0)
+    cache_read_tokens = complete_response.get("usage", {}).get("cache_read_input_tokens", 0) or 0
+    cache_creation_tokens = complete_response.get("usage", {}).get("cache_creation_input_tokens", 0) or 0


nirga · 2025-03-23T20:40:53Z

...opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/streaming.py

@@ -159,13 +164,13 @@ def build_from_streaming_response(
            completion_tokens = -1
            # prompt_usage
            if usage := complete_response.get("usage"):
-                prompt_tokens = usage.get("input_tokens", 0)
+                prompt_tokens = usage.get("input_tokens", 0) or 0


dinmukhamedm · 2025-03-24T07:47:33Z

No worries @nirga ! Responded to the comment

ellipsis-dev bot reviewed Mar 15, 2025

View reviewed changes

fix(anthropic): add thinking as a separate completion message

5c4c55c

dinmukhamedm force-pushed the anthropic-thinking branch from 7b39ca0 to 5c4c55c Compare March 16, 2025 05:31

Merge branch 'main' into anthropic-thinking

643fd78

nirga reviewed Mar 23, 2025

View reviewed changes

nirga approved these changes Mar 24, 2025

View reviewed changes

nirga merged commit e65915c into traceloop:main Mar 24, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(anthropic): add thinking as a separate completion message #2780

fix(anthropic): add thinking as a separate completion message #2780

dinmukhamedm commented Mar 15, 2025 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

dinmukhamedm commented Mar 23, 2025

nirga left a comment

nirga Mar 23, 2025

dinmukhamedm Mar 24, 2025 •

edited

Loading

nirga Mar 24, 2025

nirga Mar 23, 2025

nirga Mar 23, 2025

dinmukhamedm commented Mar 24, 2025

fix(anthropic): add thinking as a separate completion message #2780

fix(anthropic): add thinking as a separate completion message #2780

Conversation

dinmukhamedm commented Mar 15, 2025 • edited by ellipsis-dev bot Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

dinmukhamedm commented Mar 23, 2025

nirga left a comment

Choose a reason for hiding this comment

nirga Mar 23, 2025

Choose a reason for hiding this comment

dinmukhamedm Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

nirga Mar 24, 2025

Choose a reason for hiding this comment

nirga Mar 23, 2025

Choose a reason for hiding this comment

nirga Mar 23, 2025

Choose a reason for hiding this comment

dinmukhamedm commented Mar 24, 2025

dinmukhamedm commented Mar 15, 2025 •

edited by ellipsis-dev bot

Loading

dinmukhamedm Mar 24, 2025 •

edited

Loading