feat: add optional verbose metadata to /v1/infer endpoint#1305
feat: add optional verbose metadata to /v1/infer endpoint#1305Lifto wants to merge 1 commit intolightspeed-core:mainfrom
Conversation
Add development/testing feature to return extended debugging metadata from /v1/infer endpoint, similar to /v1/query. Requires dual opt-in: config flag (allow_verbose_infer) and request parameter (include_metadata). When enabled, returns tool_calls, tool_results, rag_chunks, referenced_documents, and token counts. Maintains backwards compatibility by excluding null fields from standard responses. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
WalkthroughThis PR adds verbose metadata support to the /infer endpoint, allowing clients to request and receive detailed metadata including tool calls, tool results, RAG chunks, referenced documents, and token usage. A new configuration flag Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/app/endpoints/rlsapi_v1.py (1)
464-483: Consider extracting shared response creation logic.The verbose path duplicates the
client.responses.create()call that exists inretrieve_simple_response(). Consider refactoring to avoid duplication.♻️ Proposed refactor to reduce duplication
async def retrieve_simple_response( question: str, instructions: str, tools: Optional[list[Any]] = None, model_id: Optional[str] = None, -) -> str: + return_full_response: bool = False, +) -> str | OpenAIResponseObject: """Retrieve a simple response from the LLM for a stateless query. - - Uses the Responses API for simple stateless inference, consistent with - other endpoints (query, streaming_query). - - Args: - question: The combined user input (question + context). - instructions: System instructions for the LLM. - tools: Optional list of MCP tool definitions for the LLM. - model_id: Fully qualified model identifier in provider/model format. - When omitted, the configured default model is used. - - Returns: - The LLM-generated response text. - - Raises: - APIConnectionError: If the Llama Stack service is unreachable. - HTTPException: 503 if no default model is configured. + ... + Args: + ... + return_full_response: If True, return the full OpenAIResponseObject. + + Returns: + The LLM-generated response text, or full response object if requested. """ client = AsyncLlamaStackClientHolder().get_client() resolved_model_id = model_id or await _get_default_model_id() logger.debug("Using model %s for rlsapi v1 inference", resolved_model_id) response = await client.responses.create( input=question, model=resolved_model_id, instructions=instructions, tools=tools or [], stream=False, store=False, ) response = cast(OpenAIResponseObject, response) - extract_token_usage(response.usage, resolved_model_id) - return extract_text_from_response_items(response.output) + if return_full_response: + return response + + extract_token_usage(response.usage, resolved_model_id) + return extract_text_from_response_items(response.output)Then in
infer_endpoint:- if verbose_enabled: - client = AsyncLlamaStackClientHolder().get_client() - response = await client.responses.create( - input=input_source, - model=model_id, - instructions=instructions, - tools=mcp_tools or [], - stream=False, - store=False, - ) - response = cast(OpenAIResponseObject, response) - response_text = extract_text_from_response_items(response.output) - else: - response = None - response_text = await retrieve_simple_response(...) + if verbose_enabled: + response = await retrieve_simple_response( + input_source, instructions, tools=mcp_tools, + model_id=model_id, return_full_response=True + ) + response = cast(OpenAIResponseObject, response) + response_text = extract_text_from_response_items(response.output) + else: + response = None + response_text = await retrieve_simple_response( + input_source, instructions, tools=mcp_tools, model_id=model_id + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/app/endpoints/rlsapi_v1.py` around lines 464 - 483, The verbose and non-verbose branches duplicate the client.responses.create call; refactor by extracting the shared response-creation logic into a single helper (for example a new function used by infer_endpoint and retrieve_simple_response) that calls AsyncLlamaStackClientHolder().get_client().responses.create with parameters (input_source, model_id, instructions, tools, stream, store) and returns the response object/text; then have infer_endpoint call that helper (or have retrieve_simple_response delegate to it) and remove the duplicated client.responses.create usage in the verbose branch so both paths share the same implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/app/endpoints/rlsapi_v1.py`:
- Around line 464-483: The verbose and non-verbose branches duplicate the
client.responses.create call; refactor by extracting the shared
response-creation logic into a single helper (for example a new function used by
infer_endpoint and retrieve_simple_response) that calls
AsyncLlamaStackClientHolder().get_client().responses.create with parameters
(input_source, model_id, instructions, tools, stream, store) and returns the
response object/text; then have infer_endpoint call that helper (or have
retrieve_simple_response delegate to it) and remove the duplicated
client.responses.create usage in the verbose branch so both paths share the same
implementation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 534d060a-0ac8-4214-8bda-9ec1b42ff84d
📒 Files selected for processing (4)
src/app/endpoints/rlsapi_v1.pysrc/models/config.pysrc/models/rlsapi/requests.pysrc/models/rlsapi/responses.py
Summary
Add development/testing feature to return extended debugging metadata from
/v1/inferendpoint, similar to/v1/query. Requires dual opt-in: config flag (allow_verbose_infer) and request parameter (include_metadata).When enabled, returns:
tool_calls- MCP tool calls made during inferencetool_results- Results from MCP tool callsrag_chunks- RAG chunks retrieved from documentationreferenced_documents- Source documents referencedinput_tokens/output_tokens- Token usageMaintains backwards compatibility by excluding null fields from standard responses.
Changes
models/config.py- Addedallow_verbose_inferconfig flag with RBAC implementation notesmodels/rlsapi/requests.py- Addedinclude_metadatarequest fieldmodels/rlsapi/responses.py- ExtendedRlsapiV1InferDatawith optional metadata fieldsapp/endpoints/rlsapi_v1.py- Added conditional logic andresponse_model_exclude_none=TrueTest Plan
textandrequest_id🤖 Generated with Claude Code
Summary by CodeRabbit