Skip to content

feat: add optional verbose metadata to /v1/infer endpoint#1305

Open
Lifto wants to merge 1 commit intolightspeed-core:mainfrom
Lifto:feat/verbose-infer-metadata
Open

feat: add optional verbose metadata to /v1/infer endpoint#1305
Lifto wants to merge 1 commit intolightspeed-core:mainfrom
Lifto:feat/verbose-infer-metadata

Conversation

@Lifto
Copy link

@Lifto Lifto commented Mar 10, 2026

Summary

Add development/testing feature to return extended debugging metadata from /v1/infer endpoint, similar to /v1/query. Requires dual opt-in: config flag (allow_verbose_infer) and request parameter (include_metadata).

When enabled, returns:

  • tool_calls - MCP tool calls made during inference
  • tool_results - Results from MCP tool calls
  • rag_chunks - RAG chunks retrieved from documentation
  • referenced_documents - Source documents referenced
  • input_tokens / output_tokens - Token usage

Maintains backwards compatibility by excluding null fields from standard responses.

Changes

  • models/config.py - Added allow_verbose_infer config flag with RBAC implementation notes
  • models/rlsapi/requests.py - Added include_metadata request field
  • models/rlsapi/responses.py - Extended RlsapiV1InferData with optional metadata fields
  • app/endpoints/rlsapi_v1.py - Added conditional logic and response_model_exclude_none=True

Test Plan

  • Normal request returns only text and request_id
  • Verbose request with config disabled returns standard response
  • Verbose request with config enabled returns all metadata fields
  • Backwards compatibility maintained (no null fields in standard responses)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added optional verbose metadata support for inference requests. When enabled, responses now include detailed information about tool calls, tool results, RAG chunks, referenced documents, and token usage metrics.

Add development/testing feature to return extended debugging metadata
from /v1/infer endpoint, similar to /v1/query. Requires dual opt-in:
config flag (allow_verbose_infer) and request parameter (include_metadata).

When enabled, returns tool_calls, tool_results, rag_chunks,
referenced_documents, and token counts. Maintains backwards compatibility
by excluding null fields from standard responses.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Walkthrough

This PR adds verbose metadata support to the /infer endpoint, allowing clients to request and receive detailed metadata including tool calls, tool results, RAG chunks, referenced documents, and token usage. A new configuration flag allow_verbose_infer and request field include_metadata control this behavior.

Changes

Cohort / File(s) Summary
Verbose Inference Configuration
src/models/config.py
Added allow_verbose_infer: bool = False field to Customization class to control verbose metadata availability at the server level.
API Contract Updates
src/models/rlsapi/requests.py, src/models/rlsapi/responses.py
Added include_metadata: bool field to RlsapiV1InferRequest and six new optional metadata fields (tool_calls, tool_results, rag_chunks, referenced_documents, input_tokens, output_tokens) to RlsapiV1InferData. Expanded imports to include necessary response types.
Endpoint Implementation
src/app/endpoints/rlsapi_v1.py
Implemented conditional verbose mode logic that fetches full response objects via LlamaStack client, extracts metadata fields, and computes turn_summary when verbose is enabled. Updated router decorator to include response_model_exclude_none=True for null field serialization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding optional verbose metadata to the /v1/infer endpoint, which is the primary focus of all modifications across four files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Lifto Lifto marked this pull request as draft March 10, 2026 20:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/app/endpoints/rlsapi_v1.py (1)

464-483: Consider extracting shared response creation logic.

The verbose path duplicates the client.responses.create() call that exists in retrieve_simple_response(). Consider refactoring to avoid duplication.

♻️ Proposed refactor to reduce duplication
 async def retrieve_simple_response(
     question: str,
     instructions: str,
     tools: Optional[list[Any]] = None,
     model_id: Optional[str] = None,
-) -> str:
+    return_full_response: bool = False,
+) -> str | OpenAIResponseObject:
     """Retrieve a simple response from the LLM for a stateless query.
-
-    Uses the Responses API for simple stateless inference, consistent with
-    other endpoints (query, streaming_query).
-
-    Args:
-        question: The combined user input (question + context).
-        instructions: System instructions for the LLM.
-        tools: Optional list of MCP tool definitions for the LLM.
-        model_id: Fully qualified model identifier in provider/model format.
-            When omitted, the configured default model is used.
-
-    Returns:
-        The LLM-generated response text.
-
-    Raises:
-        APIConnectionError: If the Llama Stack service is unreachable.
-        HTTPException: 503 if no default model is configured.
+    ...
+    Args:
+        ...
+        return_full_response: If True, return the full OpenAIResponseObject.
+
+    Returns:
+        The LLM-generated response text, or full response object if requested.
     """
     client = AsyncLlamaStackClientHolder().get_client()
     resolved_model_id = model_id or await _get_default_model_id()
     logger.debug("Using model %s for rlsapi v1 inference", resolved_model_id)

     response = await client.responses.create(
         input=question,
         model=resolved_model_id,
         instructions=instructions,
         tools=tools or [],
         stream=False,
         store=False,
     )
     response = cast(OpenAIResponseObject, response)
-    extract_token_usage(response.usage, resolved_model_id)
 
-    return extract_text_from_response_items(response.output)
+    if return_full_response:
+        return response
+
+    extract_token_usage(response.usage, resolved_model_id)
+    return extract_text_from_response_items(response.output)

Then in infer_endpoint:

-        if verbose_enabled:
-            client = AsyncLlamaStackClientHolder().get_client()
-            response = await client.responses.create(
-                input=input_source,
-                model=model_id,
-                instructions=instructions,
-                tools=mcp_tools or [],
-                stream=False,
-                store=False,
-            )
-            response = cast(OpenAIResponseObject, response)
-            response_text = extract_text_from_response_items(response.output)
-        else:
-            response = None
-            response_text = await retrieve_simple_response(...)
+        if verbose_enabled:
+            response = await retrieve_simple_response(
+                input_source, instructions, tools=mcp_tools,
+                model_id=model_id, return_full_response=True
+            )
+            response = cast(OpenAIResponseObject, response)
+            response_text = extract_text_from_response_items(response.output)
+        else:
+            response = None
+            response_text = await retrieve_simple_response(
+                input_source, instructions, tools=mcp_tools, model_id=model_id
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/rlsapi_v1.py` around lines 464 - 483, The verbose and
non-verbose branches duplicate the client.responses.create call; refactor by
extracting the shared response-creation logic into a single helper (for example
a new function used by infer_endpoint and retrieve_simple_response) that calls
AsyncLlamaStackClientHolder().get_client().responses.create with parameters
(input_source, model_id, instructions, tools, stream, store) and returns the
response object/text; then have infer_endpoint call that helper (or have
retrieve_simple_response delegate to it) and remove the duplicated
client.responses.create usage in the verbose branch so both paths share the same
implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/app/endpoints/rlsapi_v1.py`:
- Around line 464-483: The verbose and non-verbose branches duplicate the
client.responses.create call; refactor by extracting the shared
response-creation logic into a single helper (for example a new function used by
infer_endpoint and retrieve_simple_response) that calls
AsyncLlamaStackClientHolder().get_client().responses.create with parameters
(input_source, model_id, instructions, tools, stream, store) and returns the
response object/text; then have infer_endpoint call that helper (or have
retrieve_simple_response delegate to it) and remove the duplicated
client.responses.create usage in the verbose branch so both paths share the same
implementation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 534d060a-0ac8-4214-8bda-9ec1b42ff84d

📥 Commits

Reviewing files that changed from the base of the PR and between de8a85a and 4c6b955.

📒 Files selected for processing (4)
  • src/app/endpoints/rlsapi_v1.py
  • src/models/config.py
  • src/models/rlsapi/requests.py
  • src/models/rlsapi/responses.py

@Lifto Lifto marked this pull request as ready for review March 11, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant