Skip to content

Conversation

jordan-wong
Copy link

@jordan-wong jordan-wong commented Oct 2, 2025

What does this PR do?

Adds instrumentation for OpenAI's Responses API. Includes APM span generation and LLM-Obs span generation

Motivation

ML-Obs product priority

Plugin Checklist

Additional Notes

Copy link

github-actions bot commented Oct 2, 2025

Overall package size

Self size: 12.71 MB
Deduped: 115.32 MB
No deduping: 117.53 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | @datadog/libdatadog | 0.7.0 | 35.02 MB | 35.02 MB | | @datadog/native-appsec | 10.2.1 | 20.64 MB | 20.65 MB | | @datadog/native-iast-taint-tracking | 4.0.0 | 11.72 MB | 11.73 MB | | @datadog/pprof | 5.11.1 | 9.96 MB | 10.34 MB | | @opentelemetry/core | 1.30.1 | 908.66 kB | 7.16 MB | | protobufjs | 7.5.4 | 2.95 MB | 5.73 MB | | @datadog/wasm-js-rewriter | 4.0.1 | 2.85 MB | 3.58 MB | | @opentelemetry/resources | 1.9.1 | 306.54 kB | 1.74 MB | | @datadog/native-metrics | 3.1.1 | 1.02 MB | 1.43 MB | | @opentelemetry/api-logs | 0.205.0 | 201.51 kB | 1.42 MB | | @opentelemetry/api | 1.9.0 | 1.22 MB | 1.22 MB | | jsonpath-plus | 10.3.0 | 617.18 kB | 1.08 MB | | import-in-the-middle | 1.14.4 | 123.18 kB | 851.76 kB | | lru-cache | 10.4.3 | 804.3 kB | 804.3 kB | | @datadog/openfeature-node-server | 0.1.0-preview.10 | 95.11 kB | 401.46 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | source-map | 0.7.6 | 185.63 kB | 185.63 kB | | pprof-format | 2.2.1 | 163.06 kB | 163.06 kB | | @datadog/sketches-js | 2.1.1 | 109.9 kB | 109.9 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | ignore | 7.0.5 | 63.38 kB | 63.38 kB | | istanbul-lib-coverage | 3.2.2 | 34.37 kB | 34.37 kB | | rfdc | 1.4.1 | 27.15 kB | 27.15 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB | | @isaacs/ttlcache | 1.4.1 | 25.2 kB | 25.2 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | shell-quote | 1.8.3 | 23.74 kB | 23.74 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | semifies | 1.0.0 | 15.84 kB | 15.84 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | ttl-set | 1.0.0 | 4.61 kB | 9.69 kB | | mutexify | 1.4.0 | 5.71 kB | 8.74 kB | | path-to-regexp | 0.1.12 | 6.6 kB | 6.6 kB | | module-details-from-path | 1.0.4 | 3.96 kB | 3.96 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 20.38835% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.80%. Comparing base (f82ae45) to head (65ae2e0).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
packages/dd-trace/src/llmobs/plugins/openai.js 13.04% 80 Missing ⚠️
packages/dd-trace/src/llmobs/tagger.js 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6583      +/-   ##
==========================================
+ Coverage   81.97%   83.80%   +1.82%     
==========================================
  Files         479      503      +24     
  Lines       20424    21094     +670     
==========================================
+ Hits        16742    17677     +935     
+ Misses       3682     3417     -265     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pr-commenter
Copy link

pr-commenter bot commented Oct 2, 2025

Benchmarks

Benchmark execution time: 2025-10-15 21:45:08

Comparing candidate commit 65ae2e0 in PR branch openai-responses-instrumentation with baseline commit f82ae45 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1596 metrics, 74 unstable metrics.

Comment on lines 25 to 31
{
file: 'resources/responses',
targetClass: 'Responses',
baseResource: 'responses',
methods: ['create'],
streamedResponse: false
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
file: 'resources/responses',
targetClass: 'Responses',
baseResource: 'responses',
methods: ['create'],
streamedResponse: false
},

baseResource: 'responses',
methods: ['create'],
streamedResponse: false,
versions: ['>=4.85.0']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
versions: ['>=4.85.0']
versions: ['>=4.87.0']

targetClass: 'Responses',
baseResource: 'responses',
methods: ['create'],
streamedResponse: false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
streamedResponse: false,
streamedResponse: true,

Comment on lines 156 to 172
function wrapCreate (create) {
return function (request) {
if (!vertexaiTracingChannel.start.hasSubscribers) {
// calls the original function
return create.apply(this, arguments)
}

const ctx = {
request,
instance: this,
resource: [this.constructor.name, create.name].join('.')
}
// am I using the right channel? tracingChannel vs diagnostics channel
return ch.tracePromise(create, ctx, this, ...arguments)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function wrapCreate (create) {
return function (request) {
if (!vertexaiTracingChannel.start.hasSubscribers) {
// calls the original function
return create.apply(this, arguments)
}
const ctx = {
request,
instance: this,
resource: [this.constructor.name, create.name].join('.')
}
// am I using the right channel? tracingChannel vs diagnostics channel
return ch.tracePromise(create, ctx, this, ...arguments)
}
}

Comment on lines 196 to 205
//register patching hooks via addHook
addHook({ name: 'openai', file: 'resources/responses.js', versions: ['>=4.87.0'] }, exports => {
const Responses = exports.OpenAIApi.responses
// wrap functions on module exports with shimmer.wrap
shimmer.wrap(responses.prototype, 'responses.createResponse', wrapCreate)
return exports
})



Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//register patching hooks via addHook
addHook({ name: 'openai', file: 'resources/responses.js', versions: ['>=4.87.0'] }, exports => {
const Responses = exports.OpenAIApi.responses
// wrap functions on module exports with shimmer.wrap
shimmer.wrap(responses.prototype, 'responses.createResponse', wrapCreate)
return exports
})

Copy link
Collaborator

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really cool stuff, and i did see all the tests are passing!

i also realized that the tests are now a little out of date with what our python integration does, as it's been updated since writing the tests and the tests did not get those updates 😭 that's on me, so i'll update those tests and we might be able to simplify some logic here.

left a couple of other comments for things we might be able to clean up in the meantime, but i'll get those tests fixed up first 🫡

return finalResponse
}

// If no final response found, fall back to accumulating from deltas and items
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a specific model/response format that doesn't include the aggregated streamed response in the last chunk? i think otherwise we can safely only look at the last chunk and not worry about doing manual aggregating

Comment on lines 398 to 415
// Extract input information
if (payload.input) {
openaiStore.input = payload.input
tags['openai.request.input_length'] = payload.input.length
}

// Extract reasoning configuration
if (payload.reasoning) {
if (payload.reasoning.effort) {
tags['openai.request.reasoning.effort'] = payload.reasoning.effort
}
openaiStore.reasoning = payload.reasoning
}

// Extract background flag
if (payload.background !== undefined) {
tags['openai.request.background'] = payload.background
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can remove these blocks and just do model tagging for apm spans

@@ -191,6 +195,188 @@ class OpenAiLLMObsPlugin extends LLMObsPlugin {

this._tagger.tagMetadata(span, metadata)
}

_tagResponse (span, inputs, response, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_tagResponse (span, inputs, response, error) {
#tagResponse (span, inputs, response, error) {

just as a general JS practice this is better for private functions (i know the other ones aren't like this, i need to get around to cleaning up the rest of this code 😅)

and then calling the function above would just be

this.#tagResponse(...)

toolId: item.call_id,
name: item.name,
arguments: args,
type: 'function'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type: 'function'
type: item.type

Comment on lines 137 to 157
if (finalResponse) {
// For simple text responses, if output is empty or an empty array, accumulate from deltas
const outputIsEmpty = !finalResponse.output ||
finalResponse.output === '' ||
(Array.isArray(finalResponse.output) && finalResponse.output.length === 0)

if (outputIsEmpty) {
const outputText = chunks
.filter(chunk => chunk.type === 'response.output_text.delta')
.map(chunk => chunk.delta)
.join('')

if (outputText) {
return {
...finalResponse,
output: outputText
}
}
}
return finalResponse
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (finalResponse) {
// For simple text responses, if output is empty or an empty array, accumulate from deltas
const outputIsEmpty = !finalResponse.output ||
finalResponse.output === '' ||
(Array.isArray(finalResponse.output) && finalResponse.output.length === 0)
if (outputIsEmpty) {
const outputText = chunks
.filter(chunk => chunk.type === 'response.output_text.delta')
.map(chunk => chunk.delta)
.join('')
if (outputText) {
return {
...finalResponse,
output: outputText
}
}
}
return finalResponse
}
return finalResponse

i think this might be good here - were there any scenarios you found where the last chunk did not have aggregations done properly?

// - response.done/response.incomplete/response.completed: final response with output array and usage

// Find the last chunk with a complete response object (status: done, incomplete, or completed)
let finalResponse = null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let finalResponse = null
let finalResponse

Comment on lines 300 to 316
// Only include content if it's not empty OR if there are no tool calls
// (For responses API, tool call messages should not have content field)
if (content !== '' || !messageObj.tool_calls) {
messageObj.content = content
}

// For role, always include it (even if empty string) when there are tool calls
// Otherwise use conditional tagging which skips empty values
let condition
if (messageObj.tool_calls && messageObj.tool_calls.length > 0) {
// For tool call messages, always include role even if empty
messageObj.role = role || ''
condition = true
} else {
condition = this.#tagConditionalString(role, 'Message role', messageObj, 'role')
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've realized the shared tests are now a bit out of date with what we do in python - i will fix those and get back to you, i thinkkkkk this stuff here could be removed once i fix up the tests 🙏 i will let you know, but i did confirm that these changes work with the current tests, so thank you!

Comment on lines 349 to 376
// Tag metadata
const metadata = Object.entries(parameters).reduce((obj, [key, value]) => {
if (!['tools', 'functions', 'instructions'].includes(key)) {
obj[key] = value
}
return obj
}, {})

// Add fields from response
if (response.temperature !== undefined) metadata.temperature = Number(response.temperature)
if (response.top_p !== undefined) metadata.top_p = Number(response.top_p)
if (response.tools !== undefined) {
metadata.tools = Array.isArray(response.tools) ? [...response.tools] : response.tools
}
if (response.tool_choice !== undefined) metadata.tool_choice = response.tool_choice
if (response.truncation !== undefined) metadata.truncation = response.truncation
if (response.text !== undefined) metadata.text = response.text
if (response.usage?.output_tokens_details?.reasoning_tokens !== undefined) {
metadata.reasoning_tokens = response.usage.output_tokens_details.reasoning_tokens
}

// Add reasoning metadata from input parameters
if (reasoning) {
metadata.reasoning = reasoning
}
if (background !== undefined) {
metadata.background = background
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can keep metadata tagging to the request parameters and not from the response, and then maybe do a allowlist instead of denylist for what we want to include, ie

const metadata = Object.entries(parameters).reduce((obj, [key, value]) => {
      if (['temperature', 'top_p', 'tuncation', ...].includes(key)) {
        obj[key] = value
      }
      return obj
    }, {})

lmk if that doesn't make sense!

Comment on lines +118 to +136
if (chunks.length === 0) return {}

// The responses API streams events with different types:
// - response.output_text.delta: incremental text deltas
// - response.output_text.done: complete text for a content part
// - response.output_item.done: complete output item with role
// - response.done/response.incomplete/response.completed: final response with output array and usage

// Find the last chunk with a complete response object (status: done, incomplete, or completed)
let finalResponse
for (let i = chunks.length - 1; i >= 0; i--) {
const chunk = chunks[i]
if (chunk.response && ['done', 'incomplete', 'completed'].includes(chunk.response.status)) {
finalResponse = chunk.response
return finalResponse
}
}

return finalResponse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (chunks.length === 0) return {}
// The responses API streams events with different types:
// - response.output_text.delta: incremental text deltas
// - response.output_text.done: complete text for a content part
// - response.output_item.done: complete output item with role
// - response.done/response.incomplete/response.completed: final response with output array and usage
// Find the last chunk with a complete response object (status: done, incomplete, or completed)
let finalResponse
for (let i = chunks.length - 1; i >= 0; i--) {
const chunk = chunks[i]
if (chunk.response && ['done', 'incomplete', 'completed'].includes(chunk.response.status)) {
finalResponse = chunk.response
return finalResponse
}
}
return finalResponse
// The responses API streams events with different types:
// - response.output_text.delta: incremental text deltas
// - response.output_text.done: complete text for a content part
// - response.output_item.done: complete output item with role
// - response.done/response.incomplete/response.completed: final response with output array and usage
// Find the last chunk with a complete response object (status: done, incomplete, or completed)
for (let i = chunks.length - 1; i >= 0; i--) {
const chunk = chunks[i]
if (chunk.response && ['done', 'incomplete', 'completed'].includes(chunk.response.status)) {
return chunk.response
}
}

now that we simplified the logic a bit i think we can just return the proper chunk directly, and if there isn't one the function will return undefined by default

we could also move the ['done', 'incomplete', 'completed'] array to be a const at the top of the file

} else if (tokenUsage.prompt_tokens_details) {
// Chat/Completions API - only include if > 0
const cacheReadTokens = tokenUsage.prompt_tokens_details.cached_tokens
if (cacheReadTokens !== undefined && cacheReadTokens > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (cacheReadTokens !== undefined && cacheReadTokens > 0) {
if (cacheReadTokens) {

i think this should be the same, since if cacheReadTokens is 0 then it'll still evaluate as false-y.

if (typeof parsedArgs === 'string') {
try {
parsedArgs = JSON.parse(parsedArgs)
} catch (e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} catch (e) {
} catch {

if (item.type === 'reasoning') {
// Extract reasoning text from summary
let reasoningText = ''
if (item.summary && Array.isArray(item.summary) && item.summary.length > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (item.summary && Array.isArray(item.summary) && item.summary.length > 0) {
if (Array.isArray(item.summary) && item.summary.length > 0) {

Array.isArray will also check if non-null/non-undefined!

if (typeof args === 'string') {
try {
args = JSON.parse(args)
} catch (e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} catch (e) {
} catch {

}

// Extract tool calls if present in message.tool_calls
if (item.tool_calls && Array.isArray(item.tool_calls)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (item.tool_calls && Array.isArray(item.tool_calls)) {
if (Array.isArray(item.tool_calls)) {

this._tagger.tagLLMIO(span, inputMessages, outputMessages)

// Tag metadata - use allowlist approach for request parameters
const allowedParamKeys = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this a const at the top of the file

}

// Include content if not empty, no tool calls/results, or explicitly provided
if (content !== '' || (!messageObj.tool_calls && !messageObj.tool_results) || ('content' in message)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should be able to get rid of this changed logic here (although I know it'll fail the shared tests 😭). we actually get around this for the newly added anthropic integration by setting content to an empty string:

inputMessages.push({ content: '', role, toolResults: [toolResult] })

but i know the shared tests expect no content at all. let's revert this section, and let me see if i can modify the shared tests to account for either empty content/content not included, and we can undo this change, since this touches a path that users could run into as well.

const condition1 = this.#tagConditionalString(result, 'Tool result', toolResultObj, 'result')
const condition2 = this.#tagConditionalString(toolId, 'Tool ID', toolResultObj, 'tool_id')
// name can be empty string, so always include it
toolResultObj.name = name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
toolResultObj.name = name
if (typeof name === 'string') {
toolResultObj.name = name
} else {
this.#handleFailure(`[LLMObs] Expected tool result name to be a string, instead got "${typeof name}"`)
}

since we can't use tagConfitionalString here (which i should change in the future to only check on null/undefined an not falsy), we should add a small guard that the name is a string (which is what our backend expects)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants