LCORE-1262: OpenResponses endpoint by asimurka · Pull Request #1301 · lightspeed-core/lightspeed-stack

asimurka · 2026-03-10T13:36:19Z

Description

This PR introduces a fully functional POST /v1/responses API endpoint compatible with the OpenResponses specification. It integrates with existing LCORE capabilities including conversation tracking, shields, streaming support, RAG, MCP tools, topic summaries, and quota management.

The implementation also introduces implicit conversation management so that every response is associated with a conversation and can be inspected via the Conversations API.

Users can provide context to the LLM using one of the following mutually exclusive strategies:

conversation
previous_response_id
(no context)

In LCORE, a conversation is modeled as a linear chain of user turns (user query + LLM response), and every response belongs to exactly one conversation. Supporting previous_response_id as a context mechanism introduces branching semantics that would break this linearity if handled naively. To preserve a consistent conversation model, implicit conversation management applies the following rules:

If context is provided via conversation, the new response is appended to the existing conversation (if validation and authorization checks passed).
If no context is provided, LCORE creates a new conversation and assigns the response to it.
If context is provided via previous_response_id, LCORE determines whether the referenced response is the latest response in its conversation:
- If it is the latest response, the request is treated as a normal continuation of that conversation. The linear structure remains intact.
- If it is not the latest response, the conversation is forked. A new conversation is created, current turn is added to the new conversation as its starting point.

Important: The conversation entity acts as a structural and inspection layer. If user wants to provide full context from a forked conversation they can use previous_response_id, ensuring that the full historical context is preserved.

This design preserves strict linear conversations while enabling exploration of all generated responses.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor

Related Tickets & Documents

Related Issue # LCORE-1262
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Added new Responses endpoint with full streaming and non-streaming support.
- Added support for safety identifiers and moderation-based response handling.
- Added conversation branching via response-based forking.
- Enhanced request parameters including max output tokens and reasoning fields.
- Improved inline RAG context integration for response generation.
Documentation
- Updated API documentation terminology and behavioral descriptions for clarity.

coderabbitai · 2026-03-10T13:36:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This pull request introduces a new /responses endpoint that enables streaming and non-streaming response generation with integrated moderation, context resolution, and conversation management. Changes span the endpoint implementation, database schema, API models, utility functions, and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Responses Endpoint Implementation `src/app/endpoints/responses.py`	New 731-line endpoint module providing responses_endpoint_handler entry point, streaming and non-streaming response paths, shield moderation handling, RAG context integration, quota tracking, and conversation persistence.
API Request/Response Models `src/models/requests.py`, `src/models/responses.py`	Added new fields to ResponsesRequest (max_output_tokens, reasoning, safety_identifier, shield_ids); expanded ResponsesResponse with id, object, status, error, instructions, and reasoning fields; introduced echoed_params method and field validation.
Database Schema `src/models/database/conversations.py`	Added last_response_id field to UserConversation and response_id field to UserTurn for tracking response-turn associations.
Conversation & Query Context Management `src/utils/conversations.py`, `src/utils/endpoints.py`, `src/utils/query.py`	Added get_all_conversation_items for paginated item retrieval; introduced resolve_response_context for conversation/fork handling; added retrieve_turn_by_response_id and check_turn_existence helpers; updated persist_user_conversation_details to track response_id.
Response Processing & Utilities `src/utils/responses.py`, `src/utils/vector_search.py`, `src/utils/types.py`	Updated build_turn_summary to accept OpenAIResponseObject and populate response id; added create_new_conversation, get_zero_usage, extract_attachments_text; introduced append_inline_rag_context_to_input; expanded ResponsesApiParams with max_output_tokens and reasoning fields; added ResponsesConversationContext type.
Safety & Moderation Integration `src/utils/shields.py`, `src/utils/suid.py`	Minor adjustment to create_refusal_response formatting; added is_moderation_id utility function to identify moderation-blocked responses.
Query Endpoint Updates `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`	Updated retrieve_response signature to accept ShieldModerationResult instead of individual moderation parameters; integrated moderation_id into turn summary; minimal streaming query adjustment for blocked moderation path.
Conversation Endpoint Refactor `src/app/endpoints/conversations_v1.py`	Replaced direct database queries with utility functions get_all_conversation_items and retrieve_conversation_turns for item and turn retrieval.
Router Registration `src/app/routers.py`	Registered new responses router under /v1 prefix.
Documentation Updates `docs/responses.md`	Restructured sections (renamed "Inherited LLS OpenAPI Fields" to "Attributes"); added behavioral differences documentation; clarified LCORE-specific field exposure and streaming event details.
Endpoint & Utility Unit Tests `tests/unit/app/endpoints/test_responses.py`, `tests/unit/app/endpoints/test_conversations.py`, `tests/unit/app/test_routers.py`, `tests/unit/utils/test_*.py`	Comprehensive test coverage for responses endpoint (1364 lines), updated conversation tests for new utility integration, router registration verification, and utility function behavior validation.
Integration & E2E Tests `tests/integration/test_openapi_json.py`, `tests/e2e/features/responses.feature`, `tests/e2e/test_list.txt`	Added OpenAPI spec validation for /v1/responses endpoint; created end-to-end feature tests for streaming and non-streaming response scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ResponsesHandler as Responses<br/>Handler
    participant ConfigResolver as Config &<br/>Context Resolver
    participant ShieldMod as Shield<br/>Moderation
    participant RAGContext as RAG<br/>Context Builder
    participant LlamaStack as Llama Stack<br/>API
    participant Database as Database<br/>(Persistence)

    Client->>ResponsesHandler: POST /v1/responses
    ResponsesHandler->>ConfigResolver: validate config, resolve model
    ConfigResolver->>ConfigResolver: check model config
    ResponsesHandler->>ShieldMod: run_shield_moderation
    ShieldMod->>ShieldMod: evaluate safety rules
    alt Moderation Blocked
        ShieldMod-->>ResponsesHandler: ShieldModerationResult (blocked)
        ResponsesHandler->>Database: persist blocked turn
        ResponsesHandler-->>Client: Response with error status
    else Moderation Passed
        ShieldMod-->>ResponsesHandler: ShieldModerationResult (passed)
        ResponsesHandler->>RAGContext: build inline RAG context
        RAGContext-->>ResponsesHandler: context text
        ResponsesHandler->>LlamaStack: request response (streaming or non-streaming)
        alt Streaming Path
            loop Process Stream
                LlamaStack-->>ResponsesHandler: response chunks
                ResponsesHandler->>ResponsesHandler: track quotas, aggregate usage
                ResponsesHandler-->>Client: SSE chunks
            end
        else Non-Streaming Path
            LlamaStack-->>ResponsesHandler: full response
            ResponsesHandler->>ResponsesHandler: build response summary
        end
        ResponsesHandler->>Database: persist conversation turn
        ResponsesHandler-->>Client: ResponsesResponse (final)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

LCORE-849: Shields implementation using Moderations API #979: Introduces ShieldModerationResult type and shield moderation integration that is directly utilized in the responses endpoint and query utilities.
LCORE-1166: Added tool calls and timestamps into turn history #1096: Modifies conversation turn construction and metadata enrichment (UserTurn fields, conversation retrieval patterns) that coordinate with response_id tracking added in this PR.

Suggested reviewers

are-ces

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'LCORE-1262: OpenResponses endpoint' clearly describes the main feature added: a new OpenResponses endpoint implementation for the LCORE system, which aligns with the substantial changes introducing a POST /v1/responses API.
Docstring Coverage	✅ Passed	Docstring coverage is 96.36% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

src/models/responses.py (1)
986-993: ⚠️ Potential issue | 🟡 Minor

Example includes forked field not declared in model.

The ConversationResponse example at line 989 includes "forked": False, but the ConversationResponse class (lines 936-993) doesn't declare a forked field. This will cause the example to be inconsistent with the actual model schema.

Either add the field to the model or remove it from the example:
Option 1: Add field to model (if forked is part of the API)
 class ConversationResponse(AbstractSuccessfulResponse):
     """Model representing a response for retrieving a conversation.
 
     Attributes:
         conversation_id: The conversation ID (UUID).
         chat_history: The chat history as a list of conversation turns.
+        forked: Whether this conversation was forked from another.
     """
 
     conversation_id: str = Field(
         ...,
         description="Conversation ID (UUID)",
         examples=["c5260aec-4d82-4370-9fdf-05cf908b3f16"],
     )
 
     chat_history: list[ConversationTurn] = Field(
         ...,
         description="The simplified chat history as a list of conversation turns",
         # ... examples ...
     )
+
+    forked: bool = Field(
+        False,
+        description="Whether this conversation was forked from another",
+    )
Option 2: Remove from example (if not part of API)
                     "chat_history": [
                         {
                             "messages": [
                                 {"content": "Hello", "type": "user"},
                                 {"content": "Hi there!", "type": "assistant"},
                             ],
                             "tool_calls": [],
                             "tool_results": [],
                             "provider": "openai",
                             "model": "gpt-4o-mini",
                             "started_at": "2024-01-01T00:01:00Z",
                             "completed_at": "2024-01-01T00:01:05Z",
                         }
                     ],
-                    "forked": False,
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/responses.py` around lines 986 - 993, The example for
ConversationResponse contains a "forked" field that is not declared on the
ConversationResponse model; either add a boolean field named forked to the
ConversationResponse class (type bool, sensible default like False, update
schema/serialization and tests) if the API actually returns it, or remove
"forked" from the example JSON if the API does not include it—locate the
ConversationResponse class and the example block and apply the corresponding
change so model and example stay consistent.
tests/unit/app/endpoints/test_streaming_query.py (1)
334-366: ⚠️ Potential issue | 🟡 Minor

Assert that the moderation decision is passed into build_rag_context.

build_rag_context() gained a new positional parameter, but these tests replace it with a permissive AsyncMock and never verify the call arguments. A regression back to the old signature, or forwarding the wrong decision, would still pass here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 334 - 366, The
test replaces build_rag_context with a permissive AsyncMock but never verifies
it received the moderation decision; update the test to capture and assert that
build_rag_context was called with the moderation result returned by
run_shield_moderation (e.g., the ShieldModerationPassed instance) as the new
positional argument. Specifically, keep the AsyncMock replacement for
app.endpoints.streaming_query.build_rag_context, await the endpoint invocation,
then assert mock_build_rag_context.assert_awaited() / assert_called_with(...)
(or inspect mock_build_rag_context.await_args) includes the moderation decision
object returned by the patched run_shield_moderation so the test fails if the
decision is not forwarded.
tests/unit/app/test_routers.py (1)
111-133: ⚠️ Potential issue | 🟡 Minor

Include rags.router in the presence assertions.

The expected count was bumped to 21, but this test still never checks that rags.router is actually registered. A missing RAG router could be masked by a duplicate registration elsewhere and this test would still pass.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/test_routers.py` around lines 111 - 133, The test is missing
an assertion for rags.router; update the presence checks in test_routers.py by
adding an assertion that rags.router is in app.get_routers() (alongside the
existing assertions like conversations_v2.router and responses.router) so the
RAG router is explicitly verified.
tests/unit/utils/test_vector_search.py (1)
452-500: ⚠️ Potential issue | 🟡 Minor

Add a test for the new blocked-moderation fast path.

These updates only exercise the "passed" path. The actual behavior change here is that build_rag_context() should return an empty context and skip vector lookups when moderation is blocked, so this file should cover that branch explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_vector_search.py` around lines 452 - 500, Add a new
async test in TestBuildRagContext that calls build_rag_context with moderation
status "blocked" (instead of "passed") and asserts the returned context has
empty context_text, rag_chunks, and referenced_documents; also verify the vector
client was not used by asserting client_mock.vector_io.query (or
client_mock.vector_io.query.assert_not_awaited()) was not awaited/called.
Reference build_rag_context and client_mock.vector_io.query to locate the
behavior under test.
src/app/endpoints/query.py (1)
198-221: ⚠️ Potential issue | 🟠 Major

Don’t generate topic summaries for refused turns.

retrieve_response() can now short-circuit with a blocked moderation result, but this handler still falls through to get_topic_summary() for new conversations. That creates a second model call for a turn that was intentionally refused, and can resend the blocked prompt text when the refusal was caused by the prompt itself.
Possible fix
-    if not user_conversation and query_request.generate_topic_summary:
+    if (
+        moderation_result.decision == "passed"
+        and not user_conversation
+        and query_request.generate_topic_summary
+    ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 198 - 221, The handler currently
calls get_topic_summary for new conversations even when retrieve_response
short-circuited due to moderation; update the logic to skip topic summary
generation whenever moderation_result.decision is not "passed" (or when it
indicates a refusal) by checking moderation_result before invoking
get_topic_summary (use the existing moderation_result variable returned from
retrieve_response); only call get_topic_summary when not user_conversation,
query_request.generate_topic_summary is true, and moderation_result.decision ==
"passed".
src/utils/shields.py (1)
137-164: ⚠️ Potential issue | 🟠 Major

Translate models.list() failures the same way as the other shield API calls.

get_shields_for_request() now normalizes client.shields.list() failures, but available_models = {model.id for model in await client.models.list()} can still raise straight out of this function. If Llama Stack is unavailable during model resolution, this path will bypass the structured 503/status handling you just added for moderation calls.

Based on learnings: Applies to src/**/*.py : Handle APIConnectionError from Llama Stack in API operations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 137 - 164, The call to
client.models.list() that builds available_models can raise APIConnectionError
and currently bypasses your structured 503 handling; wrap the await
client.models.list() call (used to compute available_models in the shields
resolution path after get_shields_for_request) in a try/except that catches
APIConnectionError and converts it to a ServiceUnavailableResponse (same pattern
as the moderation create except block), then raise
HTTPException(**response.model_dump()) from the caught exception; reference
ServiceUnavailableResponse, client.models.list(), available_models, and
get_shields_for_request when locating where to add the try/except.

🧹 Nitpick comments (6)

tests/unit/utils/test_query.py (1)

410-411: Tests updated correctly for new parameter.

The test calls now include the required response_id parameter. The comment at line 507 accurately reflects that UserTurn now includes response_id.

Consider adding test cases for the moderation ID special case where last_response_id should not be updated when response_id starts with "modr".

🧪 Suggested additional test case for moderation ID handling

def test_persist_does_not_update_last_response_id_for_moderation(
    self, mocker: MockerFixture
) -> None:
    """Test that moderation IDs don't update last_response_id."""
    existing_conv = UserConversation(
        id="conv1",
        user_id="user1",
        last_used_model="old_model",
        last_used_provider="old_provider",
        message_count=5,
        last_response_id="resp_previous",
    )
    mock_session = mocker.Mock()
    mock_conv_query = mocker.Mock()
    mock_conv_query.filter_by.return_value.first.return_value = existing_conv
    # ... setup similar to test_update_existing_conversation ...

    persist_user_conversation_details(
        user_id="user1",
        conversation_id="conv1",
        started_at="2024-01-01T00:00:00Z",
        completed_at="2024-01-01T00:00:05Z",
        model_id="new_model",
        provider_id="new_provider",
        topic_summary=None,
        response_id="modr_blocked_123",  # Moderation ID
    )

    # last_response_id should remain unchanged
    assert existing_conv.last_response_id == "resp_previous"

Also applies to: 458-459, 502-507

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_query.py` around lines 410 - 411, Add a unit test in
tests/unit/utils/test_query.py that verifies persist_user_conversation_details
does not update UserConversation.last_response_id when the provided response_id
begins with the moderation prefix (e.g., "modr"); locate the test pattern used
in test_update_existing_conversation to mock an existing UserConversation (id,
user_id, last_response_id set to a sentinel), call
persist_user_conversation_details with response_id="modr_..." (or
"modr_blocked_123"), and assert the existing_conv.last_response_id remains the
original value; reference the function persist_user_conversation_details, the
UserConversation model, and the last_response_id field when adding this new test
case.

src/utils/query.py (1)

40-47: Consider consolidating imports from the same module.

The is_moderation_id import (line 47) could be combined with the existing normalize_conversation_id import (line 40) from the same module.

♻️ Consolidate imports

-from utils.suid import normalize_conversation_id
+from utils.suid import is_moderation_id, normalize_conversation_id
 from utils.token_counter import TokenCounter
 from utils.transcripts import (
     create_transcript,
     create_transcript_metadata,
     store_transcript,
 )
-from utils.suid import is_moderation_id
 from utils.types import TurnSummary

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/query.py` around lines 40 - 47, The imports from utils.suid are
duplicated; consolidate them by importing both normalize_conversation_id and
is_moderation_id in a single from utils.suid import ... statement (replace the
separate normalize_conversation_id and is_moderation_id import lines with one
combined import) to remove the redundant import and keep imports tidy.

src/utils/suid.py (1)

108-113: Consider handling non-string input for consistency.

The existing check_suid function (lines 39-40) validates that input is a string before processing. This new function will raise AttributeError if passed None. While callers currently pass valid strings, adding a guard would be consistent with the module's defensive style.
🛡️ Optional: Add type guard for consistency
 def is_moderation_id(suid: str) -> bool:
     """Check if given string is a moderation ID.
 
     Returns True if the string starts with 'modr'.
     """
+    if not isinstance(suid, str):
+        return False
     return suid.startswith("modr")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/suid.py` around lines 108 - 113, The is_moderation_id(suid: str)
function can raise AttributeError on non-string inputs; update it to guard
against non-strings (consistent with check_suid) by returning False when suid is
not a str (or by calling the existing check_suid(suid) first) and only then
checking startswith("modr"), so callers receiving None or other types won't
crash.

tests/e2e/features/responses.feature (1)

8-24: Assert the response contract, not just the status code.

Both scenarios currently pass even if /v1/responses returns the wrong payload shape, and the "stream": true case also passes if the endpoint stops streaming but still replies with HTTP 200. Please add at least one body assertion for the non-streaming path and a streaming-specific assertion (headers and/or event chunks) for the streaming path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/features/responses.feature` around lines 8 - 24, The tests only
assert HTTP 200; update the feature assertions so the non-streaming scenario
validates the response body shape (e.g., presence of output text/choices or a
top-level "output" or "generation" field) and the streaming scenario asserts
streaming behavior (e.g., Transfer-Encoding/Content-Type headers for SSE/chunked
and at least one event/chunk containing a partial generation). Modify the step
implementations used by "When I use \"responses\" to ask question with
authorization header" and the "Then The status code of the response is 200"
verification to add a body contract check for the non-streaming case and a
header + chunk/event assertion for the streaming case so the tests fail if the
payload shape or streaming behavior changes.

src/models/requests.py (1)

739-745: Use the shared moderation-ID helper here.

Hardcoding the "modr" prefix duplicates ID classification logic inside utils.suid. Calling the shared helper keeps this validator aligned if the moderation-ID format changes later.

Suggested fix

     `@field_validator`("previous_response_id")
     `@classmethod`
     def check_previous_response_id(cls, value: Optional[str]) -> Optional[str]:
         """Validate that previous_response_id does not start with 'modr'."""
-        if value is not None and value.startswith("modr"):
+        if value is not None and suid.is_moderation_id(value):
             raise ValueError("You cannot provide context by moderation response.")
         return value

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/requests.py` around lines 739 - 745, Replace the hardcoded prefix
check in the `@field_validator` for previous_response_id
(check_previous_response_id) with the shared moderation-ID helper from
utils.suid: call the helper (e.g., utils.suid.is_moderation_suid(value) or the
equivalent function provided) instead of value.startswith("modr"), raise the
same ValueError when the helper returns true, and add the necessary import for
that helper at the top of the file.

src/app/endpoints/responses.py (1)

624-624: Remove unused assignment.

completed_at assigned here is always overwritten at line 697, making this a dead assignment.

🧹 Proposed fix

     if moderation_result.decision == "blocked":
-        completed_at = datetime.now(UTC)
         output_text = moderation_result.message

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` at line 624, Remove the dead assignment to
completed_at in src/app/endpoints/responses.py: the early "completed_at =
datetime.now(UTC)" is always overwritten later (the other completed_at
assignment near where the response is finalized), so delete the initial
assignment to avoid the unused variable; keep the later, correct completed_at
assignment and any uses that rely on it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/responses.md`:
- Around line 522-524: The docs currently state that providing shield_ids as an
empty list returns 422, but the implementation treats an empty list as "run no
shields"; update the documentation for the `shield_ids` request parameter to
reflect that behavior (i.e., if omitted, all configured shields are used; if
provided and empty, no shields are run) and ensure `generate_topic_summary` and
`solr` descriptions remain accurate; reference the `shield_ids` parameter name
to locate the doc string to change.
- Line 90: The documentation is inconsistent: the request/response tables
declare the reasoning field as a string while the dedicated "reasoning" section
defines it as an object with an effort property; decide on a single canonical
shape and make all mentions match it — either change the tables' `reasoning`
type to `object` (with a short inline note like `{ effort: string }`) and update
examples/schemas to show `{ "reasoning": { "effort": "low|medium|high" } }`, or
simplify the dedicated section to document `reasoning` as a `string` and remove
the object/effort example; update all occurrences of `reasoning` (including the
detailed section and examples) so the symbol `reasoning` has the same type
everywhere.

In `@src/app/endpoints/query.py`:
- Around line 298-302: The current code logs the full response payload at error
level (see the response variable cast to OpenAIResponseObject and the call to
response.model_dump()), which misclassifies normal requests as errors; change
the log to logger.debug and log only minimal identifiers (e.g., response id,
status, or request id) instead of the entire model_dump, and remove or downgrade
the logger.error("Response: %s", response.model_dump()) call so only concise
diagnostic IDs are emitted at debug level.

In `@src/app/endpoints/responses.py`:
- Line 535: The comment contains a typo: replace "Explicitely append the turn to
conversation if context passed by previous response" with "Explicitly append the
turn to conversation if context passed by previous response" (update the comment
string wherever "Explicitely" appears, e.g., the comment near the logic that
appends a turn to the conversation/context in responses.py).
- Around line 558-566: The docstring Args block lists parameters that don't
match the function signature: remove the nonexistent parameter
moderation_result, replace the incorrect user_id entry with the actual auth
parameter, and ensure the Args list and short descriptions match the real
parameters (generator, turn_summary, client which is an AsyncLlamaStackClient,
auth, input_text, started_at, api_params which is a ResponsesApiParams); update
each description to correspond to those symbols so the docstring accurately
reflects the function signature.
- Around line 609-616: Update the Args section of the function docstring where
it currently lists api_params and user_id: replace api_params with request and
user_id with auth so the docstring matches the actual function signature;
specifically edit the docstring entries that mention "api_params" and "user_id"
in the responses endpoint (the Args block in the function around the
AsyncLlamaStackClient usage) to read "request" and "auth" respectively, and
ensure the descriptions still correspond to the request object and authenticated
auth parameter.
- Around line 269-271: The docstring in the function inside
src/app/endpoints/responses.py contains stale Args entries for
forked_from_response_id and prev_valid_response_id that are not present in the
function signature; update the function's docstring (remove those two parameters
or replace them with the actual parameters used by the function) so the Args
section matches the real signature (search for the function definition in
responses.py and edit its docstring accordingly).
- Line 655: Update the inline comment that currently reads "# Explicitely append
the turn to conversation if context passed by previous response" to correct the
typo so it reads "# Explicitly append the turn to conversation if context passed
by previous response"; locate and change this comment in the responses.py area
where the conversation turn is appended (search for the existing comment text
near the code that appends the turn to the conversation).

In `@src/app/endpoints/streaming_query.py`:
- Around line 303-317: The blocked-moderation branch returns a normal generator
(shield_violation_generator) so generate_response() treats it as a successful
completed stream and later calls get_topic_summary(), causing an extra model
call; change the behavior in the branch where context.moderation_result.decision
== "blocked" (and where shield_violation_generator is returned) to instead
signal a non-success stream to the caller—for example by raising or returning
the same error/exception path generate_response() uses for failures or by
setting the stream result flag that generate_response() checks—so that
generate_response() does not execute post-stream success logic (including
get_topic_summary()); update code paths around generate_response(),
shield_violation_generator, and any stream success flag or exception used to
mark failed/blocked streams accordingly.
- Around line 259-264: The merge of inline RAG docs happens too early: because
retrieve_response_generator() returns a lazy generator,
turn_summary.referenced_documents is still empty when you call
deduplicate_referenced_documents with inline_rag_context.referenced_documents,
so those inline docs get overwritten later by response_generator(); move the
merge to after the generator has been consumed (e.g., after response_generator()
completes or inside the response.completed handling) so you combine
inline_rag_context.referenced_documents with the final
turn_summary.referenced_documents and then call
deduplicate_referenced_documents; update the logic around
retrieve_response_generator(), response_generator(), and response.completed to
perform the merge at that later point.

In `@src/models/requests.py`:
- Around line 694-695: The ResponsesRequest Pydantic model currently sets
model_config = {"extra": "ignore"}, which silently drops unknown request fields;
change this to {"extra": "forbid"} on the ResponsesRequest class so Pydantic
returns a 422 for unexpected keys, update any tests or callers that relied on
ignored extras, and run the test suite to ensure no callers depend on the
previous ignore behavior.

In `@src/utils/conversations.py`:
- Around line 497-507: Wrap the Llama Stack pagination in
get_all_conversation_items with a try/except that catches APIConnectionError and
APIStatusError around both the initial await of paginator (first_page) and the
subsequent page.get_next_page() loop; when caught, log the error and convert it
to the project’s standard HTTP response (e.g., raise FastAPI HTTPException with
an appropriate 5xx status and a concise message including the original error),
and ensure the required exception classes are imported so raw client exceptions
are not leaked from paginator/first_page/page.get_next_page.
- Around line 473-481: The code in append_turn_items_to_conversation swallows
upstream APIStatusError and always returns a generic 500; instead, preserve the
upstream status and message: in the APIStatusError except block (alongside the
existing APIConnectionError handling), read the status code from the exception
(e.g., e.status_code or e.response.status_code) and the error text/details
(e.g., str(e) or e.response.text) and raise HTTPException(status_code=<upstream
status>, detail=<upstream detail>) from e rather than using
InternalServerErrorResponse.generic(); keep the APIConnectionError branch as-is.

In `@src/utils/endpoints.py`:
- Around line 305-325: Rename the misspelled function check_turn_existance to
check_turn_existence (and update its docstring) and update every call/reference
to that symbol (e.g., places that invoke check_turn_existance) to the new name;
also update the logger message string that contains "existance" to "existence"
and ensure any exported API or tests that reference the old name are adjusted
accordingly (search for check_turn_existance to find all locations to change).

In `@tests/unit/app/endpoints/test_responses.py`:
- Around line 543-555: Remove the test-side workaround that patches
ResponsesResponse.model_validate (the _validate_with_model helper and the
mocker.patch.object call) so the test exercises the real validation; instead
call ResponsesResponse.model_validate on the handler's produced payload and
assert the payload includes a real "model" value (and otherwise let validation
fail), ensuring the test fails until the handler itself sets the model field.
Keep references to ResponsesResponse and model_validate when updating the test
and delete the _validate_with_model/mocker.patch.object usage.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 864-877: The test does not verify that the blocked path actually
persists the refusal turn; after calling retrieve_response_generator ensure the
patched append_turn_items_to_conversation AsyncMock was awaited with the refusal
data from mock_context.moderation_result.refusal_response (and the expected
conversation/context identifiers) so the blocked-flow persistence is exercised.
Locate the AsyncMock patch for
app.endpoints.streaming_query.append_turn_items_to_conversation and add an
assertion like: append_turn_items_to_conversation.assert_awaited_once_with(...
refusal_response ..., mock_context or conversation id) referencing
retrieve_response_generator, mock_context.moderation_result.refusal_response,
and append_turn_items_to_conversation.

---

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 198-221: The handler currently calls get_topic_summary for new
conversations even when retrieve_response short-circuited due to moderation;
update the logic to skip topic summary generation whenever
moderation_result.decision is not "passed" (or when it indicates a refusal) by
checking moderation_result before invoking get_topic_summary (use the existing
moderation_result variable returned from retrieve_response); only call
get_topic_summary when not user_conversation,
query_request.generate_topic_summary is true, and moderation_result.decision ==
"passed".

In `@src/models/responses.py`:
- Around line 986-993: The example for ConversationResponse contains a "forked"
field that is not declared on the ConversationResponse model; either add a
boolean field named forked to the ConversationResponse class (type bool,
sensible default like False, update schema/serialization and tests) if the API
actually returns it, or remove "forked" from the example JSON if the API does
not include it—locate the ConversationResponse class and the example block and
apply the corresponding change so model and example stay consistent.

In `@src/utils/shields.py`:
- Around line 137-164: The call to client.models.list() that builds
available_models can raise APIConnectionError and currently bypasses your
structured 503 handling; wrap the await client.models.list() call (used to
compute available_models in the shields resolution path after
get_shields_for_request) in a try/except that catches APIConnectionError and
converts it to a ServiceUnavailableResponse (same pattern as the moderation
create except block), then raise HTTPException(**response.model_dump()) from the
caught exception; reference ServiceUnavailableResponse, client.models.list(),
available_models, and get_shields_for_request when locating where to add the
try/except.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 334-366: The test replaces build_rag_context with a permissive
AsyncMock but never verifies it received the moderation decision; update the
test to capture and assert that build_rag_context was called with the moderation
result returned by run_shield_moderation (e.g., the ShieldModerationPassed
instance) as the new positional argument. Specifically, keep the AsyncMock
replacement for app.endpoints.streaming_query.build_rag_context, await the
endpoint invocation, then assert mock_build_rag_context.assert_awaited() /
assert_called_with(...) (or inspect mock_build_rag_context.await_args) includes
the moderation decision object returned by the patched run_shield_moderation so
the test fails if the decision is not forwarded.

In `@tests/unit/app/test_routers.py`:
- Around line 111-133: The test is missing an assertion for rags.router; update
the presence checks in test_routers.py by adding an assertion that rags.router
is in app.get_routers() (alongside the existing assertions like
conversations_v2.router and responses.router) so the RAG router is explicitly
verified.

In `@tests/unit/utils/test_vector_search.py`:
- Around line 452-500: Add a new async test in TestBuildRagContext that calls
build_rag_context with moderation status "blocked" (instead of "passed") and
asserts the returned context has empty context_text, rag_chunks, and
referenced_documents; also verify the vector client was not used by asserting
client_mock.vector_io.query (or
client_mock.vector_io.query.assert_not_awaited()) was not awaited/called.
Reference build_rag_context and client_mock.vector_io.query to locate the
behavior under test.

---

Nitpick comments:
In `@src/app/endpoints/responses.py`:
- Line 624: Remove the dead assignment to completed_at in
src/app/endpoints/responses.py: the early "completed_at = datetime.now(UTC)" is
always overwritten later (the other completed_at assignment near where the
response is finalized), so delete the initial assignment to avoid the unused
variable; keep the later, correct completed_at assignment and any uses that rely
on it.

In `@src/models/requests.py`:
- Around line 739-745: Replace the hardcoded prefix check in the
`@field_validator` for previous_response_id (check_previous_response_id) with the
shared moderation-ID helper from utils.suid: call the helper (e.g.,
utils.suid.is_moderation_suid(value) or the equivalent function provided)
instead of value.startswith("modr"), raise the same ValueError when the helper
returns true, and add the necessary import for that helper at the top of the
file.

In `@src/utils/query.py`:
- Around line 40-47: The imports from utils.suid are duplicated; consolidate
them by importing both normalize_conversation_id and is_moderation_id in a
single from utils.suid import ... statement (replace the separate
normalize_conversation_id and is_moderation_id import lines with one combined
import) to remove the redundant import and keep imports tidy.

In `@src/utils/suid.py`:
- Around line 108-113: The is_moderation_id(suid: str) function can raise
AttributeError on non-string inputs; update it to guard against non-strings
(consistent with check_suid) by returning False when suid is not a str (or by
calling the existing check_suid(suid) first) and only then checking
startswith("modr"), so callers receiving None or other types won't crash.

In `@tests/e2e/features/responses.feature`:
- Around line 8-24: The tests only assert HTTP 200; update the feature
assertions so the non-streaming scenario validates the response body shape
(e.g., presence of output text/choices or a top-level "output" or "generation"
field) and the streaming scenario asserts streaming behavior (e.g.,
Transfer-Encoding/Content-Type headers for SSE/chunked and at least one
event/chunk containing a partial generation). Modify the step implementations
used by "When I use \"responses\" to ask question with authorization header" and
the "Then The status code of the response is 200" verification to add a body
contract check for the non-streaming case and a header + chunk/event assertion
for the streaming case so the tests fail if the payload shape or streaming
behavior changes.

In `@tests/unit/utils/test_query.py`:
- Around line 410-411: Add a unit test in tests/unit/utils/test_query.py that
verifies persist_user_conversation_details does not update
UserConversation.last_response_id when the provided response_id begins with the
moderation prefix (e.g., "modr"); locate the test pattern used in
test_update_existing_conversation to mock an existing UserConversation (id,
user_id, last_response_id set to a sentinel), call
persist_user_conversation_details with response_id="modr_..." (or
"modr_blocked_123"), and assert the existing_conv.last_response_id remains the
original value; reference the function persist_user_conversation_details, the
UserConversation model, and the last_response_id field when adding this new test
case.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4e2a78b-c87f-4692-8aea-4b8ba8f7d4fd

📥 Commits

Reviewing files that changed from the base of the PR and between 36c2ae2 and 63592ab.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (35)

docs/openapi.json
docs/responses.md
pyproject.toml
src/app/endpoints/conversations_v1.py
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/app/routers.py
src/constants.py
src/models/context.py
src/models/database/conversations.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/query.py
src/utils/responses.py
src/utils/shields.py
src/utils/suid.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/info.feature
tests/e2e/features/responses.feature
tests/e2e/test_list.txt
tests/integration/test_openapi_json.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/app/test_routers.py
tests/unit/utils/test_conversations.py
tests/unit/utils/test_query.py
tests/unit/utils/test_responses.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

docs/responses.md

src/app/endpoints/query.py

src/app/endpoints/responses.py

src/utils/conversations.py

src/utils/endpoints.py

tests/unit/app/endpoints/test_responses.py

tests/unit/app/endpoints/test_streaming_query.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (5)

src/utils/conversations.py (1)
479-481: ⚠️ Potential issue | 🟠 Major

Preserve upstream APIStatusError details here.

Both helpers collapse every Llama Stack status failure into a generic 500. That turns upstream 404/401/403 cases into misleading internal errors and makes callers/debugging substantially harder. Please forward the upstream status/message instead of always returning InternalServerErrorResponse.generic().

Also applies to: 515-517
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/conversations.py` around lines 479 - 481, In the except
APIStatusError blocks (the handlers that now call
InternalServerErrorResponse.generic()), preserve and forward the upstream status
and message from the caught exception variable (e) into the HTTPException
instead of always returning a generic 500; replace the fixed
InternalServerErrorResponse.generic() usage with a response that uses
e.status_code (or e.status and/or e.status_code property available on
APIStatusError) and e.message (or str(e)) as the body/description, and apply the
same change to both occurrences around the catch at the APIStatusError handling
near the code shown and the duplicated block at lines 515-517 so callers receive
the original upstream HTTP status and message.
docs/responses.md (1)
106-106: ⚠️ Potential issue | 🟡 Minor

Spell out the shield_ids=[] contract.

Both sections describe the omitted case, but they still leave shield_ids: [] ambiguous. Clients need an explicit note that an empty list disables shields rather than inheriting the configured defaults.

Also applies to: 523-523
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` at line 106, Update the `shield_ids` documentation entry
to explicitly state the difference between omitting the field and passing an
empty array: clarify that omitting `shield_ids` will apply all configured
shields (inherit defaults) while providing `shield_ids: []` disables all
shields, and apply this exact clarification to both occurrences of the
`shield_ids` table row so the contract is unambiguous; reference the
`shield_ids` symbol in the docs and use wording like "omit to inherit all
configured shields; explicitly set to [] to disable shields" for consistency.
src/app/endpoints/streaming_query.py (2)
254-258: ⚠️ Potential issue | 🟠 Major

Inline RAG chunks are still dropped on streaming requests.

This only threads inline_rag_context.referenced_documents into the generator. The BYOK/Solr chunks never make it into turn_summary.rag_chunks, so persisted streaming metadata diverges from the non-streaming path whenever inline RAG is used.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 254 - 258, The streaming
path only passes inline_rag_context.referenced_documents into
retrieve_response_generator so BYOK/Solr RAG chunks never make it into the
returned turn_summary.rag_chunks; update the streaming flow so all RAG chunks
are included: either pass a combined list
(inline_rag_context.referenced_documents plus the BYOK/Solr chunk lists) into
retrieve_response_generator (where generator, turn_summary are returned) or,
immediately after the call to retrieve_response_generator, merge the BYOK/Solr
chunks into turn_summary.rag_chunks. Refer to the symbols
retrieve_response_generator, inline_rag_context.referenced_documents, and
turn_summary.rag_chunks when making the change.
300-316: ⚠️ Potential issue | 🟠 Major

Blocked moderation still looks like a successful stream.

shield_violation_generator() finishes normally, so generate_response() sets stream_completed=True and runs the success-only tail afterwards. For new conversations that includes get_topic_summary(), which means blocked prompts can still trigger an extra model call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 300 - 316, The
blocked-moderation path currently returns shield_violation_generator which
completes normally, causing generate_response to treat the stream as successful;
change the flow so completion is treated as an aborted/failed stream: implement
a dedicated exception (e.g., ModerationBlocked) and update
shield_violation_generator to raise that exception after emitting the refusal
payload (or otherwise yield then raise) so generate_response sees a non-success
termination; reference shield_violation_generator, ModerationBlocked (new), and
generate_response to locate where to adjust behavior and where the exception
will be handled to skip success-only tail like get_topic_summary().
tests/unit/app/endpoints/test_streaming_query.py (1)
867-873: ⚠️ Potential issue | 🟡 Minor

Assert the blocked-turn append was actually awaited.

The test patches append_turn_items_to_conversation() but never checks the awaited call or the refusal_response payload. Capture the AsyncMock and use assert_awaited_once_with(...), otherwise a regression that drops the persistence step or passes the wrong refusal item still passes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 867 - 873, The
test currently patches append_turn_items_to_conversation with an AsyncMock but
never asserts it was awaited or that the refusal payload was passed; change the
patch to capture the AsyncMock (e.g., mock_append = mocker.AsyncMock() and patch
append_turn_items_to_conversation to new=mock_append) and after invoking the
code assert the mock was awaited exactly once with the expected conversation
identifier/context and the moderation refusal payload (use
mock_moderation_result.refusal_response and the test's conversation/context
object) via mock_append.assert_awaited_once_with(...).

🧹 Nitpick comments (1)

src/utils/endpoints.py (1)
245-253: Don’t return the parent UserConversation for a forked conversation.

This branch creates a brand-new conversation ID, but the returned ResponsesConversationContext still carries the source conversation row. That violates the context model’s “None for new ones” contract and leaves downstream consumers guessing whether user_conversation refers to the new thread or the parent. Return None here, or add a separate source_conversation field if the parent row is still needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/endpoints.py` around lines 245 - 253, The branch that forks a
conversation returns a new ID from create_new_conversation but still sets
user_conversation in the ResponsesConversationContext, violating the contract
that new conversations must have user_conversation = None; update the return to
set user_conversation to None (and keep generate_topic_summary logic using
want_topic_summary) or alternatively add a new field source_conversation to
ResponsesConversationContext and populate that with the original
user_conversation while setting user_conversation to None; modify the
instantiation of ResponsesConversationContext (the return block) accordingly so
downstream code can unambiguously tell this is a new forked conversation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/endpoints/query.py`:
- Around line 198-212: The post-response processing currently only gates the
inline-RAG merge on moderation_result.decision == "passed" but lets blocked
turns continue into the topic-summary/success-only tail (causing extra model
calls on refused content); change the handler so that all post-response work
after await retrieve_response(...) — including merging
inline_rag_context.rag_chunks and turn_summary.rag_chunks, combining
referenced_documents via deduplicate_referenced_documents, and any subsequent
topic-summary logic that uses turn_summary — is executed only when
moderation_result.decision == "passed" (i.e., wrap the entire post-response
block that touches inline_rag_context, turn_summary, tool_rag_chunks,
tool_rag_documents, and downstream topic-summary functions inside the same
if-check).

In `@src/models/requests.py`:
- Around line 739-745: The validator check_previous_response_id currently
rejects any previous_response_id starting with "modr", which breaks follow-ups
when responses are emitted with id=moderation_result.moderation_id (the code
that sets response.id to moderation_result.moderation_id). Fix this by removing
or relaxing the startswith("modr") rejection in check_previous_response_id so
"modr"-prefixed IDs are accepted, and ensure downstream conversation-state
resolution logic (where previous_response_id is consumed) maps moderation IDs to
the real conversation entries; alternatively, stop emitting
moderation_result.moderation_id as the public response.id and assign a different
stable response id instead (adjust the response construction that sets
response.id accordingly).

In `@src/utils/types.py`:
- Around line 277-278: The current unconditional result.pop("conversation",
None) removes the conversation for forked requests; update
resolve_response_context() so it only drops "conversation" when the
previous_response_id is a linear/tip continuation. Specifically, determine
whether the previous_response_id refers to the tip (e.g., use or expose the
resolved context flag that indicates a tip/linear continuation from
resolve_response_context or check a property like previous_response_is_tip) and
only call result.pop("conversation", None) in that case so forks keep the
conversation and downstream responses.create() can attach to the new
conversation.

---

Duplicate comments:
In `@docs/responses.md`:
- Line 106: Update the `shield_ids` documentation entry to explicitly state the
difference between omitting the field and passing an empty array: clarify that
omitting `shield_ids` will apply all configured shields (inherit defaults) while
providing `shield_ids: []` disables all shields, and apply this exact
clarification to both occurrences of the `shield_ids` table row so the contract
is unambiguous; reference the `shield_ids` symbol in the docs and use wording
like "omit to inherit all configured shields; explicitly set to [] to disable
shields" for consistency.

In `@src/app/endpoints/streaming_query.py`:
- Around line 254-258: The streaming path only passes
inline_rag_context.referenced_documents into retrieve_response_generator so
BYOK/Solr RAG chunks never make it into the returned turn_summary.rag_chunks;
update the streaming flow so all RAG chunks are included: either pass a combined
list (inline_rag_context.referenced_documents plus the BYOK/Solr chunk lists)
into retrieve_response_generator (where generator, turn_summary are returned)
or, immediately after the call to retrieve_response_generator, merge the
BYOK/Solr chunks into turn_summary.rag_chunks. Refer to the symbols
retrieve_response_generator, inline_rag_context.referenced_documents, and
turn_summary.rag_chunks when making the change.
- Around line 300-316: The blocked-moderation path currently returns
shield_violation_generator which completes normally, causing generate_response
to treat the stream as successful; change the flow so completion is treated as
an aborted/failed stream: implement a dedicated exception (e.g.,
ModerationBlocked) and update shield_violation_generator to raise that exception
after emitting the refusal payload (or otherwise yield then raise) so
generate_response sees a non-success termination; reference
shield_violation_generator, ModerationBlocked (new), and generate_response to
locate where to adjust behavior and where the exception will be handled to skip
success-only tail like get_topic_summary().

In `@src/utils/conversations.py`:
- Around line 479-481: In the except APIStatusError blocks (the handlers that
now call InternalServerErrorResponse.generic()), preserve and forward the
upstream status and message from the caught exception variable (e) into the
HTTPException instead of always returning a generic 500; replace the fixed
InternalServerErrorResponse.generic() usage with a response that uses
e.status_code (or e.status and/or e.status_code property available on
APIStatusError) and e.message (or str(e)) as the body/description, and apply the
same change to both occurrences around the catch at the APIStatusError handling
near the code shown and the duplicated block at lines 515-517 so callers receive
the original upstream HTTP status and message.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 867-873: The test currently patches
append_turn_items_to_conversation with an AsyncMock but never asserts it was
awaited or that the refusal payload was passed; change the patch to capture the
AsyncMock (e.g., mock_append = mocker.AsyncMock() and patch
append_turn_items_to_conversation to new=mock_append) and after invoking the
code assert the mock was awaited exactly once with the expected conversation
identifier/context and the moderation refusal payload (use
mock_moderation_result.refusal_response and the test's conversation/context
object) via mock_append.assert_awaited_once_with(...).

---

Nitpick comments:
In `@src/utils/endpoints.py`:
- Around line 245-253: The branch that forks a conversation returns a new ID
from create_new_conversation but still sets user_conversation in the
ResponsesConversationContext, violating the contract that new conversations must
have user_conversation = None; update the return to set user_conversation to
None (and keep generate_topic_summary logic using want_topic_summary) or
alternatively add a new field source_conversation to
ResponsesConversationContext and populate that with the original
user_conversation while setting user_conversation to None; modify the
instantiation of ResponsesConversationContext (the return block) accordingly so
downstream code can unambiguously tell this is a new forked conversation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ff8e977c-7ba5-44b0-b4ce-7e95796adfec

📥 Commits

Reviewing files that changed from the base of the PR and between 63592ab and f553d06.

📒 Files selected for processing (13)

docs/openapi.json
docs/responses.md
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/types.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/endpoints/test_streaming_query.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/app/endpoints/responses.py

src/app/endpoints/query.py

src/models/requests.py

src/utils/types.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/utils/shields.py (1)

137-143: ⚠️ Potential issue | 🟠 Major

Return early when no shields are selected.

get_shields_for_request() can legitimately return [], but client.models.list() still runs unconditionally. That means a request can still fail on model discovery even when moderation is effectively disabled, and any exception from models.list() bypasses the local error mapping.

💡 Minimal guard

     shields_to_run = await get_shields_for_request(client, shield_ids)
+    if not shields_to_run:
+        return ShieldModerationPassed()
+
     available_models = {model.id for model in await client.models.list()}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 137 - 143, If get_shields_for_request(...)
returns an empty list, avoid calling client.models.list() unnecessarily; add an
early return/check after shields_to_run = await get_shields_for_request(client,
shield_ids) (or skip the model discovery) so that when shields_to_run is empty
you do not compute available_models or run client.models.list(), and only
compute available_models and perform the provider_resource_id checks for shields
when shields_to_run contains items (referencing shields_to_run,
available_models, client.models.list, shield.provider_id and
shield.provider_resource_id).

♻️ Duplicate comments (6)

src/utils/conversations.py (2)
468-481: ⚠️ Potential issue | 🟠 Major

Preserve upstream APIStatusError details when appending items.

conversations.items.create() can fail with non-500 statuses, and flattening every status error to InternalServerErrorResponse.generic() gives callers the wrong failure mode for missing or forbidden conversations. Bubble the upstream status/detail instead of hard-coding 500 here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/conversations.py` around lines 468 - 481, The current except block
for APIStatusError in the conversations.items.create call masks upstream status
and details by returning InternalServerErrorResponse.generic(); instead extract
the actual status and message from the APIStatusError (e.g., use e.status_code
or e.response.status_code and e.response.text / str(e) as available) and raise
HTTPException with the preserved status_code and detail (and any relevant
headers) so callers receive the original upstream error (keep the existing
APIConnectionError handling unchanged).
497-517: ⚠️ Potential issue | 🟠 Major

Don't translate paginated conversation read failures to a generic 500.

Both items.list() and page.get_next_page() can raise APIStatusError. Mapping all of them to InternalServerErrorResponse.generic() means /conversations/{id} returns 500 even when Llama Stack is explicitly saying "not found" or "forbidden."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/conversations.py` around lines 497 - 517, The current except
APIStatusError block always returns InternalServerErrorResponse.generic(), which
hides Llama Stack errors like 404/403 from client.conversations.items.list and
page.get_next_page; modify the APIStatusError handler to inspect the exception
(APIStatusError) for the upstream status/code and message (e.g., e.status or
e.response.status_code and e.message or e.response.text), and raise
HTTPException with that status_code and a payload containing the upstream error
details (falling back to InternalServerErrorResponse.generic() only if no status
can be determined); keep the existing APIConnectionError handling and reuse
ServiceUnavailableResponse where appropriate.
src/models/requests.py (1)
739-745: ⚠️ Potential issue | 🟠 Major

Don't reject moderation response IDs that /responses already exposes.

Blocked /responses replies are built with id=moderation_result.moderation_id in src/app/endpoints/responses.py, so this validator makes the normal previous_response_id=response.id follow-up fail immediately after a moderated turn. If this special case stays, route it through the shared moderation-ID helper instead of another hard-coded prefix check.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/models/requests.py` around lines 739 - 745, The validator
check_previous_response_id currently rejects any previous_response_id that
startswith("modr") which blocks legitimate IDs created by
src/app/endpoints/responses.py (moderation_result.moderation_id); remove the
hard-coded prefix check and instead call the shared moderation-ID helper used by
the responses endpoint (e.g., the project’s
is_moderation_id/normalize_moderation_id helper) to determine if the value is a
moderation ID and only reject when the shared helper indicates it’s an
invalid/forbidden moderation ID; update check_previous_response_id to rely on
that helper so responses-created moderation IDs are allowed.
src/app/endpoints/streaming_query.py (1)

300-316: ⚠️ Potential issue | 🟠 Major

Blocked streams still fall into the success epilogue.

Returning shield_violation_generator() here makes generate_response() treat the stream as completed successfully, so blocked turns still run the post-stream success path (get_topic_summary(), quota/end events, and persistence) instead of short-circuiting as a refused request.
tests/unit/app/endpoints/test_streaming_query.py (1)
870-877: ⚠️ Potential issue | 🟡 Minor

Assert the blocked-path persistence call.

This test patches append_turn_items_to_conversation() but never verifies that it was awaited with the refusal item, so the blocked-flow persistence wiring can regress without failing the suite.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 870 - 877, The
test patches append_turn_items_to_conversation but never asserts it was awaited
with the refusal item; after calling retrieve_response_generator and receiving
turn_summary, construct the expected refusal item from turn_summary (e.g., the
refusal text/role or the item that represents the refusal in turn_summary) and
add an assertion that the AsyncMock was awaited with the conversation and that
refusal item, e.g.
append_turn_items_to_conversation.assert_awaited_once_with(mock_context["conversation"],
[expected_refusal_item]); this ensures the blocked-path persistence wiring
(append_turn_items_to_conversation + retrieve_response_generator) is exercised.
src/app/endpoints/query.py (1)
198-221: ⚠️ Potential issue | 🟠 Major

Blocked turns still fall through to topic summary generation.

When moderation_result.decision == "blocked", the RAG merge is correctly skipped, but the topic summary generation at lines 214-221 is not guarded. This causes an unnecessary LLM call (get_topic_summary) on content that was just blocked by moderation.

Consider wrapping the topic summary block with the same moderation check:
Possible fix
     # Get topic summary for new conversation
-    if not user_conversation and query_request.generate_topic_summary:
+    if (
+        moderation_result.decision == "passed"
+        and not user_conversation
+        and query_request.generate_topic_summary
+    ):
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(
             query_request.query, client, responses_params.model
         )
     else:
         topic_summary = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 198 - 221, The topic-summary call is
executed even when moderation_result.decision == "blocked"; wrap the existing
topic summary generation (the block that checks user_conversation and
query_request.generate_topic_summary and calls get_topic_summary) with the same
moderation check used for RAG merging (i.e., ensure moderation_result.decision
== "passed" before calling get_topic_summary), so that
get_topic_summary(query_request.query, client, responses_params.model) only runs
for passed content and set topic_summary = None otherwise.

🧹 Nitpick comments (1)

src/utils/endpoints.py (1)

229-235: Collapse the previous_response_id existence check into a single query.

check_turn_existence() and retrieve_turn_by_response_id() hit the same row back-to-back. That adds an extra query and leaves a race window between the check and the read. Call retrieve_turn_by_response_id() directly and let its 404 path handle the missing case.

♻️ Suggested simplification

-        if not check_turn_existence(previous_response_id):
-            error_response = NotFoundResponse(
-                resource="response", resource_id=previous_response_id
-            )
-            raise HTTPException(**error_response.model_dump())
         prev_user_turn = retrieve_turn_by_response_id(previous_response_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/endpoints.py` around lines 229 - 235, Collapse the two database
calls into one: remove the pre-check using check_turn_existence() and call
retrieve_turn_by_response_id(previous_response_id) directly; if
retrieve_turn_by_response_id returns None or raises when not found, convert that
into the same NotFoundResponse/HTTPException you previously built
(resource="response", resource_id=previous_response_id) so a single query both
fetches the turn and handles the 404 path without the extra race window.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/responses.md`:
- Line 72: TOC links pointing to old anchors are broken after renaming the
"Inherited LLS OpenAPI Attributes" heading; update the table-of-contents entries
(and any other links) to match the new Markdown anchor (e.g.,
inherited-lls-openapi-attributes) so they point to the renamed heading text, and
repeat the same fix for the other occurrence mentioned (the second "Inherited
LLS OpenAPI Attributes" link).

In `@src/app/endpoints/responses.py`:
- Around line 576-579: The code currently calls get_topic_summary() whenever
generate_topic_summary is true even for turns that were blocked by moderation;
change the logic so topic summary generation is skipped if the moderation result
indicates the turn was refused/blocked. Concretely, guard the
get_topic_summary() calls (the ones at the generate_topic_summary sites) with an
additional condition that the moderation result is not blocked/refused (e.g.,
check the existing moderation flag/variable such as shield_blocked,
moderation_result.refused, or similar) so you only call
get_topic_summary(input_text, client, api_params.model) when
generate_topic_summary is true AND the moderation check passed; apply this
change to both occurrences referenced in the diff.
- Around line 225-247: The handler builds inline_rag_context via
build_rag_context but only forwards referenced_documents; modify the call that
selects and invokes handle_streaming_response / handle_non_streaming_response to
also pass inline_rag_context.context_text (e.g., as a parameter named
static_rag_context or inline_rag_context_text) so downstream code (and
ultimately client.responses.create) receives the RAG context text; update the
parameter list in handle_streaming_response and handle_non_streaming_response
(and their usages) to accept and forward that context_text into
client.responses.create().

In `@src/utils/shields.py`:
- Around line 255-256: The code currently treats an explicit empty override
(shield_ids == []) as "run none" by returning [], which bypasses moderation;
instead, detect an explicit empty list and reject it by raising an HTTP 422
(Unprocessable Entity) error consistent with QueryRequest.shield_ids. Locate the
check that inspects shield_ids (the `if shield_ids == []:` branch) and replace
the return with raising an appropriate 422 error (include a clear detail message
referencing shield_ids) so callers receive validation failure rather than an
empty-shields bypass.

---

Outside diff comments:
In `@src/utils/shields.py`:
- Around line 137-143: If get_shields_for_request(...) returns an empty list,
avoid calling client.models.list() unnecessarily; add an early return/check
after shields_to_run = await get_shields_for_request(client, shield_ids) (or
skip the model discovery) so that when shields_to_run is empty you do not
compute available_models or run client.models.list(), and only compute
available_models and perform the provider_resource_id checks for shields when
shields_to_run contains items (referencing shields_to_run, available_models,
client.models.list, shield.provider_id and shield.provider_resource_id).

---

Duplicate comments:
In `@src/app/endpoints/query.py`:
- Around line 198-221: The topic-summary call is executed even when
moderation_result.decision == "blocked"; wrap the existing topic summary
generation (the block that checks user_conversation and
query_request.generate_topic_summary and calls get_topic_summary) with the same
moderation check used for RAG merging (i.e., ensure moderation_result.decision
== "passed" before calling get_topic_summary), so that
get_topic_summary(query_request.query, client, responses_params.model) only runs
for passed content and set topic_summary = None otherwise.

In `@src/models/requests.py`:
- Around line 739-745: The validator check_previous_response_id currently
rejects any previous_response_id that startswith("modr") which blocks legitimate
IDs created by src/app/endpoints/responses.py (moderation_result.moderation_id);
remove the hard-coded prefix check and instead call the shared moderation-ID
helper used by the responses endpoint (e.g., the project’s
is_moderation_id/normalize_moderation_id helper) to determine if the value is a
moderation ID and only reject when the shared helper indicates it’s an
invalid/forbidden moderation ID; update check_previous_response_id to rely on
that helper so responses-created moderation IDs are allowed.

In `@src/utils/conversations.py`:
- Around line 468-481: The current except block for APIStatusError in the
conversations.items.create call masks upstream status and details by returning
InternalServerErrorResponse.generic(); instead extract the actual status and
message from the APIStatusError (e.g., use e.status_code or
e.response.status_code and e.response.text / str(e) as available) and raise
HTTPException with the preserved status_code and detail (and any relevant
headers) so callers receive the original upstream error (keep the existing
APIConnectionError handling unchanged).
- Around line 497-517: The current except APIStatusError block always returns
InternalServerErrorResponse.generic(), which hides Llama Stack errors like
404/403 from client.conversations.items.list and page.get_next_page; modify the
APIStatusError handler to inspect the exception (APIStatusError) for the
upstream status/code and message (e.g., e.status or e.response.status_code and
e.message or e.response.text), and raise HTTPException with that status_code and
a payload containing the upstream error details (falling back to
InternalServerErrorResponse.generic() only if no status can be determined); keep
the existing APIConnectionError handling and reuse ServiceUnavailableResponse
where appropriate.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 870-877: The test patches append_turn_items_to_conversation but
never asserts it was awaited with the refusal item; after calling
retrieve_response_generator and receiving turn_summary, construct the expected
refusal item from turn_summary (e.g., the refusal text/role or the item that
represents the refusal in turn_summary) and add an assertion that the AsyncMock
was awaited with the conversation and that refusal item, e.g.
append_turn_items_to_conversation.assert_awaited_once_with(mock_context["conversation"],
[expected_refusal_item]); this ensures the blocked-path persistence wiring
(append_turn_items_to_conversation + retrieve_response_generator) is exercised.

---

Nitpick comments:
In `@src/utils/endpoints.py`:
- Around line 229-235: Collapse the two database calls into one: remove the
pre-check using check_turn_existence() and call
retrieve_turn_by_response_id(previous_response_id) directly; if
retrieve_turn_by_response_id returns None or raises when not found, convert that
into the same NotFoundResponse/HTTPException you previously built
(resource="response", resource_id=previous_response_id) so a single query both
fetches the turn and handles the 404 path without the extra race window.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99307210-19f4-4090-a63b-be8ed66f6f33

📥 Commits

Reviewing files that changed from the base of the PR and between f553d06 and aa6c416.

📒 Files selected for processing (14)

docs/openapi.json
docs/responses.md
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/shields.py
src/utils/types.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/endpoints/test_streaming_query.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/utils/types.py

docs/responses.md

src/app/endpoints/responses.py

src/utils/shields.py

asimurka · 2026-03-11T07:38:32Z

src/utils/responses.py

        return ""

    message_item = cast(ResponseMessage, response_item)
-    if message_item.role == "user":


The function can be used for both input and output items.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (9)

docs/responses.md (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Fix the TOC anchor to match the renamed heading.

The TOC entry at line 12 links to #inherited-lls-openapi-fields, but the heading at line 72 is now "Inherited LLS OpenAPI Attributes". Markdown will generate the anchor #inherited-lls-openapi-attributes, so this link is broken.
🔗 Proposed fix
-  * [Inherited LLS OpenAPI Fields](`#inherited-lls-openapi-fields`)
+  * [Inherited LLS OpenAPI Attributes](`#inherited-lls-openapi-attributes`)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` at line 12, TOC link target is out of sync with the
renamed heading: update the TOC entry that currently links to
"#inherited-lls-openapi-fields" to the correct anchor
"#inherited-lls-openapi-attributes" (or revert the heading to "Inherited LLS
OpenAPI Fields") so the link matches the heading "Inherited LLS OpenAPI
Attributes"; locate the bullet "* [Inherited LLS OpenAPI
Fields](`#inherited-lls-openapi-fields`)" and change the anchor to
"#inherited-lls-openapi-attributes".

src/app/endpoints/query.py (1)

212-217: ⚠️ Potential issue | 🟠 Major

Skip topic-summary generation for blocked turns.

When moderation returns a refusal, this block still makes a second model call for the new-conversation topic summary. Guard it with moderation_result.decision == "passed" so blocked prompts never re-enter the LLM path.

Suggested fix

-    if not user_conversation and query_request.generate_topic_summary:
+    if (
+        moderation_result.decision == "passed"
+        and not user_conversation
+        and query_request.generate_topic_summary
+    ):
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(
             query_request.query, client, responses_params.model
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 212 - 217, The topic-summary
generation currently runs for new conversations unconditionally; change the
guard so that you only call get_topic_summary(query_request.query, client,
responses_params.model) when this is a new conversation (not user_conversation),
query_request.generate_topic_summary is true, AND the moderation result
indicates the prompt passed (i.e., moderation_result.decision == "passed");
update the if condition that surrounds get_topic_summary to include the
moderation_result.decision check so blocked/refused turns never trigger the
second LLM call.

src/utils/shields.py (1)

255-256: ⚠️ Potential issue | 🟠 Major

Reject shield_ids=[] instead of disabling moderation.

Returning [] here turns an explicit empty override into a full moderation bypass. This should raise a 422 so callers cannot opt out of shields by sending an empty list.

Suggested fix

-    if shield_ids == []:
-        return []
+    if shield_ids == []:
+        response = UnprocessableEntityResponse(
+            response="Invalid shield_ids",
+            cause=(
+                "shield_ids must be omitted or contain at least one shield identifier."
+            ),
+        )
+        raise HTTPException(**response.model_dump())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 255 - 256, The code currently treats an
explicit empty override list as a signal to disable moderation by returning []
when shield_ids == []; change this to reject an explicit empty list with a 422
error instead. In the function that validates/handles the shield_ids parameter
(the block containing the check "if shield_ids == []:"), replace the return []
with raising an HTTP 422 (e.g., raise HTTPException(status_code=422,
detail="shield_ids cannot be an empty list") or the project's equivalent) so
callers cannot opt out of shields by sending an empty list; keep None or missing
values allowed if the rest of the code expects them.

src/app/endpoints/streaming_query.py (2)

254-258: ⚠️ Potential issue | 🟠 Major

Carry inline RAG chunks through the streaming path too.

Only inline_rag_docs are threaded through here. Unlike the non-streaming query path, turn_summary.rag_chunks never receives inline_rag_context.rag_chunks, so stored streaming results lose all pre-query RAG chunks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 254 - 258, The streaming
path calls retrieve_response_generator but only threads
inline_rag_context.referenced_documents as inline_rag_docs, so
turn_summary.rag_chunks never gets inline_rag_context.rag_chunks and stored
streaming results lose pre-query RAG chunks; fix by ensuring the streaming call
carries the rag_chunks through—either pass inline_rag_context.rag_chunks into
retrieve_response_generator (add an inline_rag_chunks/rag_chunks param) or after
the call assign turn_summary.rag_chunks = inline_rag_context.rag_chunks so
turn_summary contains the same rag chunks as the non-streaming path (update the
call site around retrieve_response_generator and the handling of turn_summary to
include inline_rag_context.rag_chunks).
300-315: ⚠️ Potential issue | 🟠 Major

Blocked streams still go through the success tail.

This branch returns a normal generator, so generate_response() marks the stream as completed and still runs success-only post-processing, including topic summary generation for new conversations. Return a blocked/failure signal alongside the generator so the success path is skipped.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 300 - 315, The blocked
moderation branch currently returns only the generator and turn_summary
(shield_violation_generator, turn_summary) so generate_response treats it as
success; change this return to include an explicit failure signal (e.g., return
(shield_violation_generator(...), turn_summary, False) or a Failure enum) and
update generate_response to check that third value and skip success-only
post-processing (topic summary, etc.) when it indicates failure; adjust callers
to destructure the new triple where necessary.

src/utils/types.py (1)

258-279: ⚠️ Potential issue | 🟠 Major

Separate previous_response_id handling from the MCP tool rewrite.

Because this branch sits behind the early returns above it, linear continuations without tools keep both conversation and previous_response_id. And when tools are present, forked continuations still lose the fresh conversation unconditionally. Please move the conversation stripping out of the tool rehydration block and gate it only on the resolved tip/linear case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/types.py` around lines 258 - 279, The conversation popping logic
currently sits inside the tool rehydration block in model_dump; move it out and
only strip "conversation" for linear continuations (i.e., when there are no
tools). Concretely, remove the if self.previous_response_id:
result.pop("conversation", None) from inside the loop/after tool-rewrite, and
instead add a guarded pop like if self.previous_response_id and not self.tools:
result.pop("conversation", None) placed after you compute result and
dumped_tools but before returning early for tool handling or at the end of the
function so it executes only for the resolved tip/linear case; keep the existing
tool rehydration (tools/dumped_tools loop) unchanged.

src/app/endpoints/responses.py (2)

692-695: ⚠️ Potential issue | 🟡 Minor

Topic summary generation should be skipped for shield-blocked responses.

Similar to the streaming path, when moderation_result.decision == "blocked", the code still generates a topic summary if request.generate_topic_summary is True. This makes an unnecessary LLM call for refused content.

🛡️ Proposed fix

     # Get topic summary for new conversation
     topic_summary = None
-    if request.generate_topic_summary:
+    if request.generate_topic_summary and moderation_result.decision != "blocked":
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(input_text, client, api_params.model)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 692 - 695, When
moderation_result.decision == "blocked" we should skip generating a topic
summary to avoid an unnecessary LLM call; update the logic around
request.generate_topic_summary and the call to get_topic_summary so you only
call get_topic_summary(input_text, client, api_params.model) when
request.generate_topic_summary is true AND moderation_result.decision !=
"blocked", and ensure topic_summary stays None for blocked responses (the
variable topic_summary and the logging call remain but the await
get_topic_summary must be gated by the moderation_result check).

284-296: ⚠️ Potential issue | 🟡 Minor

Verify topic summary generation is skipped for shield-blocked requests.

When moderation blocks, turn_summary.llm_response is set to the refusal message at line 286, but generate_topic_summary is still passed to generate_response (line 341). This could generate a topic summary for blocked content, which wastes an LLM call and may produce inappropriate summaries.

Consider passing generate_topic_summary=False when moderation is blocked, or adding a guard in generate_response.

🛡️ Proposed fix

+    is_blocked = moderation_result.decision == "blocked"
     return StreamingResponse(
         generate_response(
             generator=generator,
             turn_summary=turn_summary,
             client=client,
             auth=auth,
             input_text=input_text,
             started_at=started_at,
             api_params=api_params,
-            generate_topic_summary=request.generate_topic_summary or False,
+            generate_topic_summary=(request.generate_topic_summary or False) and not is_blocked,
         ),
         media_type="text/event-stream",
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 284 - 296, The topic-summary
generator should be disabled for shield-blocked requests: detect when
moderation_result.decision == "blocked" (where turn_summary.llm_response is set
and shield_violation_generator is created) and ensure generate_response is
invoked with generate_topic_summary=False (or add a guard inside
generate_response to skip summary generation when moderation_result indicates a
block); update the call site that currently passes through
generate_topic_summary to respect this blocked state so no topic-summary LLM
call is made for shielded requests.

tests/unit/app/endpoints/test_streaming_query.py (1)

864-880: ⚠️ Potential issue | 🟡 Minor

Test does not verify blocked path persistence.

The test patches append_turn_items_to_conversation but never asserts it was called. This means a regression in the blocked-flow persistence would go undetected.

💚 Proposed fix to verify persistence

+        append_mock = mocker.patch(
+            "app.endpoints.streaming_query.append_turn_items_to_conversation",
+            new=mocker.AsyncMock(),
+        )
-        mocker.patch(
-            "app.endpoints.streaming_query.append_turn_items_to_conversation",
-            new=mocker.AsyncMock(),
-        )

         _generator, turn_summary = await retrieve_response_generator(
             mock_responses_params, mock_context, []
         )

         assert isinstance(turn_summary, TurnSummary)
         assert turn_summary.llm_response == "Content blocked"
+        
+        # Verify persistence was called with refusal response
+        append_mock.assert_awaited_once()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 864 - 880, The
test currently patches append_turn_items_to_conversation but never asserts it
was invoked; after calling retrieve_response_generator with mock_context (whose
moderation_result.decision == "blocked"), add an assertion that the patched
append_turn_items_to_conversation AsyncMock was awaited (e.g.,
assert_called_once or assert_awaited_with) and verify it was passed the expected
elements such as mock_context (or its conversation) and
mock_context.moderation_result.refusal_response so the blocked-flow persistence
is actually tested alongside the existing TurnSummary and llm_response
assertions.

🧹 Nitpick comments (2)

docs/responses.md (2)
106-106: Document the behavior when shield_ids is provided as an empty list.

The documentation states what happens when shield_ids is omitted (all configured shields are used), but it doesn't specify what happens when an empty list is explicitly provided. Based on the implementation context, an empty list means "run no shields".
📝 Suggested clarification
-| `shield_ids` | array[string] | Shield IDs to apply. If omitted, all configured shields in LCORE are used | No |
+| `shield_ids` | array[string] | Shield IDs to apply. If omitted, all configured shields in LCORE are used. If provided as an empty list, no shields are run | No |
Also update line 523 in the LCORE-Specific Extensions section under Behavioral Differences to match.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` at line 106, Update the docs to explicitly state that when
the `shield_ids` parameter is provided as an empty array it means "run no
shields" (distinct from omitting it which runs all configured shields), and
mirror that clarification in the LCORE-Specific Extensions -> Behavioral
Differences section so both places consistently document the empty-list behavior
for `shield_ids`.
388-388: Clarify the id field description.

The description "Unique response ID or moderation ID" is ambiguous. Does the field contain a response ID in all cases, or does it vary based on whether the request was moderated? Consider clarifying whether this is always a response ID or if the type changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` at line 388, Update the `id` field description to clarify
when it contains a response ID versus a moderation ID: replace "Unique response
ID or moderation ID" with a precise statement such as "Unique identifier for the
returned item: the response ID for normal responses, or the moderation ID when
the request was moderated (string)"; ensure the `id` entry in docs/responses.md
explicitly states that the value varies based on moderation and that the type is
always string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/vector_search.py`:
- Around line 639-672: The function append_inline_rag_context_to_input currently
mutates the passed input_value (e.g., message.content and part.text); instead
create and return a fresh ResponseInput: if input_value is a str, return a new
concatenated string; if it is a sequence, deep-copy the list and items (e.g.,
via copy.deepcopy or constructing new ResponseItem/ResponseMessage objects),
find the first user message (inspect item.type and item.role), modify the copy's
message.content or the input_text part.text on the copy, and return the copied
structure so the original input_value is not mutated.

---

Duplicate comments:
In `@docs/responses.md`:
- Line 12: TOC link target is out of sync with the renamed heading: update the
TOC entry that currently links to "#inherited-lls-openapi-fields" to the correct
anchor "#inherited-lls-openapi-attributes" (or revert the heading to "Inherited
LLS OpenAPI Fields") so the link matches the heading "Inherited LLS OpenAPI
Attributes"; locate the bullet "* [Inherited LLS OpenAPI
Fields](`#inherited-lls-openapi-fields`)" and change the anchor to
"#inherited-lls-openapi-attributes".

In `@src/app/endpoints/query.py`:
- Around line 212-217: The topic-summary generation currently runs for new
conversations unconditionally; change the guard so that you only call
get_topic_summary(query_request.query, client, responses_params.model) when this
is a new conversation (not user_conversation),
query_request.generate_topic_summary is true, AND the moderation result
indicates the prompt passed (i.e., moderation_result.decision == "passed");
update the if condition that surrounds get_topic_summary to include the
moderation_result.decision check so blocked/refused turns never trigger the
second LLM call.

In `@src/app/endpoints/responses.py`:
- Around line 692-695: When moderation_result.decision == "blocked" we should
skip generating a topic summary to avoid an unnecessary LLM call; update the
logic around request.generate_topic_summary and the call to get_topic_summary so
you only call get_topic_summary(input_text, client, api_params.model) when
request.generate_topic_summary is true AND moderation_result.decision !=
"blocked", and ensure topic_summary stays None for blocked responses (the
variable topic_summary and the logging call remain but the await
get_topic_summary must be gated by the moderation_result check).
- Around line 284-296: The topic-summary generator should be disabled for
shield-blocked requests: detect when moderation_result.decision == "blocked"
(where turn_summary.llm_response is set and shield_violation_generator is
created) and ensure generate_response is invoked with
generate_topic_summary=False (or add a guard inside generate_response to skip
summary generation when moderation_result indicates a block); update the call
site that currently passes through generate_topic_summary to respect this
blocked state so no topic-summary LLM call is made for shielded requests.

In `@src/app/endpoints/streaming_query.py`:
- Around line 254-258: The streaming path calls retrieve_response_generator but
only threads inline_rag_context.referenced_documents as inline_rag_docs, so
turn_summary.rag_chunks never gets inline_rag_context.rag_chunks and stored
streaming results lose pre-query RAG chunks; fix by ensuring the streaming call
carries the rag_chunks through—either pass inline_rag_context.rag_chunks into
retrieve_response_generator (add an inline_rag_chunks/rag_chunks param) or after
the call assign turn_summary.rag_chunks = inline_rag_context.rag_chunks so
turn_summary contains the same rag chunks as the non-streaming path (update the
call site around retrieve_response_generator and the handling of turn_summary to
include inline_rag_context.rag_chunks).
- Around line 300-315: The blocked moderation branch currently returns only the
generator and turn_summary (shield_violation_generator, turn_summary) so
generate_response treats it as success; change this return to include an
explicit failure signal (e.g., return (shield_violation_generator(...),
turn_summary, False) or a Failure enum) and update generate_response to check
that third value and skip success-only post-processing (topic summary, etc.)
when it indicates failure; adjust callers to destructure the new triple where
necessary.

In `@src/utils/shields.py`:
- Around line 255-256: The code currently treats an explicit empty override list
as a signal to disable moderation by returning [] when shield_ids == []; change
this to reject an explicit empty list with a 422 error instead. In the function
that validates/handles the shield_ids parameter (the block containing the check
"if shield_ids == []:"), replace the return [] with raising an HTTP 422 (e.g.,
raise HTTPException(status_code=422, detail="shield_ids cannot be an empty
list") or the project's equivalent) so callers cannot opt out of shields by
sending an empty list; keep None or missing values allowed if the rest of the
code expects them.

In `@src/utils/types.py`:
- Around line 258-279: The conversation popping logic currently sits inside the
tool rehydration block in model_dump; move it out and only strip "conversation"
for linear continuations (i.e., when there are no tools). Concretely, remove the
if self.previous_response_id: result.pop("conversation", None) from inside the
loop/after tool-rewrite, and instead add a guarded pop like if
self.previous_response_id and not self.tools: result.pop("conversation", None)
placed after you compute result and dumped_tools but before returning early for
tool handling or at the end of the function so it executes only for the resolved
tip/linear case; keep the existing tool rehydration (tools/dumped_tools loop)
unchanged.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 864-880: The test currently patches
append_turn_items_to_conversation but never asserts it was invoked; after
calling retrieve_response_generator with mock_context (whose
moderation_result.decision == "blocked"), add an assertion that the patched
append_turn_items_to_conversation AsyncMock was awaited (e.g.,
assert_called_once or assert_awaited_with) and verify it was passed the expected
elements such as mock_context (or its conversation) and
mock_context.moderation_result.refusal_response so the blocked-flow persistence
is actually tested alongside the existing TurnSummary and llm_response
assertions.

---

Nitpick comments:
In `@docs/responses.md`:
- Line 106: Update the docs to explicitly state that when the `shield_ids`
parameter is provided as an empty array it means "run no shields" (distinct from
omitting it which runs all configured shields), and mirror that clarification in
the LCORE-Specific Extensions -> Behavioral Differences section so both places
consistently document the empty-list behavior for `shield_ids`.
- Line 388: Update the `id` field description to clarify when it contains a
response ID versus a moderation ID: replace "Unique response ID or moderation
ID" with a precise statement such as "Unique identifier for the returned item:
the response ID for normal responses, or the moderation ID when the request was
moderated (string)"; ensure the `id` entry in docs/responses.md explicitly
states that the value varies based on moderation and that the type is always
string.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 92ec9903-9469-4377-947c-37c91c6bc0fd

📥 Commits

Reviewing files that changed from the base of the PR and between aa6c416 and 95e12bc.

📒 Files selected for processing (16)

docs/openapi.json
docs/responses.md
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/endpoints/test_streaming_query.py

🚧 Files skipped from review as they are similar to previous changes (4)

tests/unit/app/endpoints/test_conversations.py
src/utils/conversations.py
tests/unit/app/endpoints/test_responses.py
src/models/requests.py

src/utils/vector_search.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (9)

src/utils/vector_search.py (1)

639-672: ⚠️ Potential issue | 🟠 Major

Return a fresh ResponseInput instead of mutating the caller's models.

This function directly mutates message.content (line 666) and part.text (line 670) on objects within the passed input_value list. As per coding guidelines: "Avoid in-place parameter modification anti-patterns; return new data structures instead."

♻️ Proposed fix using deep copy

+import copy
+
 def append_inline_rag_context_to_input(
     input_value: ResponseInput,
     inline_rag_context_text: str,
 ) -> ResponseInput:
     """Append inline RAG context to Responses API input.
 
     If input is str, appends the context text.
     If input is a sequence of items, appends the context to the text of the first user message.
     If there is no user message, returns the input unchanged.
 
     Parameters:
         input_value: The request input (string or list of ResponseItem).
         inline_rag_context_text: RAG context string to inject.
 
     Returns:
-        The same type as input_value, with context merged in.
+        A new ResponseInput with context merged in (original is not mutated).
     """
     if not inline_rag_context_text:
         return input_value
     if isinstance(input_value, str):
         return input_value + "\n\n" + inline_rag_context_text
+    
+    # Deep copy to avoid mutating the original
+    input_copy = copy.deepcopy(input_value)
-    for item in input_value:
+    for item in input_copy:
         if item.type != "message" or item.role != "user":
             continue
         message = cast(ResponseMessage, item)
         content = message.content
         if isinstance(content, str):
             message.content = content + "\n\n" + inline_rag_context_text
-            return input_value
+            return input_copy
         for part in content:
             if part.type == "input_text":
                 part.text = part.text + "\n\n" + inline_rag_context_text
-                return input_value
-    return input_value
+                return input_copy
+    return input_copy

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/vector_search.py` around lines 639 - 672, The function
append_inline_rag_context_to_input mutates the input_value in-place by modifying
message.content and part.text; instead, create and return a new ResponseInput
without mutating the caller's objects: detect whether input_value is a str and
return a new string with the context appended, or if it's a sequence, produce a
shallow copy of the list and replace the first matching ResponseItem
(type=="message" and role=="user") with a cloned ResponseMessage whose content
is also cloned (copy the string or clone the content parts) and with the
inline_rag_context_text appended to the cloned content (modify the cloned
part.text, not the original); ensure append_inline_rag_context_to_input returns
this new list (or new string) and leaves the original input_value,
ResponseMessage, and parts unchanged.

src/utils/shields.py (1)

255-256: ⚠️ Potential issue | 🟠 Major

Reject shield_ids=[] instead of treating it as "run none".

Returning an empty list when shield_ids == [] creates a moderation bypass. An explicit empty override should be rejected with HTTP 422, consistent with QueryRequest.shield_ids validation.

🛡️ Proposed fix

-    if shield_ids == []:
-        return []
+    if shield_ids == []:
+        response = UnprocessableEntityResponse(
+            response="Invalid shield_ids",
+            cause="shield_ids must be omitted or contain at least one shield identifier.",
+        )
+        raise HTTPException(**response.model_dump())

You'll also need to add UnprocessableEntityResponse to the imports from models.responses.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 255 - 256, The function currently treats
an explicit shield_ids == [] as "run none" by returning an empty list; instead
detect this case and raise an UnprocessableEntityResponse (HTTP 422) so explicit
empty overrides are rejected; update the logic in the function that references
shield_ids to replace the early return with raising UnprocessableEntityResponse
(include a clear message like "shield_ids may not be an empty list"), and add
UnprocessableEntityResponse to the imports from models.responses so the
exception is available.

src/utils/endpoints.py (1)

319-321: ⚠️ Potential issue | 🟡 Minor

Fix remaining typo: "existance" → "existence" in log message.

The function was renamed correctly, but the log message still contains the typo.

✏️ Proposed fix

         logger.exception(
-            "Database error while checking turn existance for response_id %s",
+            "Database error while checking turn existence for response_id %s",
             response_id,
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/endpoints.py` around lines 319 - 321, Update the log message passed
to logger.exception in the database check for turn existence (the
logger.exception call that currently uses "Database error while checking turn
existance for response_id %s") to correct the typo "existance" to "existence" so
the message reads "Database error while checking turn existence for response_id
%s"; keep the existing arguments (response_id) and call site intact.

tests/unit/app/endpoints/test_streaming_query.py (1)

867-877: ⚠️ Potential issue | 🟡 Minor

Assert the blocked-turn append payload here.

This still passes if append_turn_items_to_conversation() is awaited with the wrong conversation or refusal item. Please switch this to assert_awaited_once_with(...) and verify it uses mock_responses_params.conversation plus mock_moderation_result.refusal_response.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 867 - 877, The
test currently doesn't assert the payload passed to
app.endpoints.streaming_query.append_turn_items_to_conversation; change the mock
assertion to use append_turn_items_to_conversation.assert_awaited_once_with(...)
and verify the first arg is mock_responses_params.conversation and the second
arg is mock_moderation_result.refusal_response (use the same mock names as in
the test where retrieve_response_generator is awaited) so the blocked-turn
append is validated explicitly.

src/app/endpoints/query.py (1)

198-219: ⚠️ Potential issue | 🟠 Major

Blocked requests still fall into the topic-summary path.

Only the inline-RAG merge is gated. For a new conversation with generate_topic_summary=True, a blocked moderation result still reaches get_topic_summary() below and triggers another model call on refused content. The success-only post-processing needs to stay entirely behind the decision == "passed" branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 198 - 219, The topic-summary
generation is executed even when moderation_result.decision != "passed"; move
the entire new-conversation/topic-summary path into the success branch so it
only runs on passed content: inside the block guarded by
moderation_result.decision == "passed" (where inline_rag_context, turn_summary,
rag_chunks, and referenced_documents are handled), also perform the check for
not user_conversation and query_request.generate_topic_summary and call
get_topic_summary(query_request.query, client, responses_params.model) there;
otherwise set topic_summary = None outside that branch. Ensure you reference the
same symbols (moderation_result.decision, inline_rag_context, turn_summary,
get_topic_summary, query_request.generate_topic_summary) so the logic remains
fully gated by the moderation pass.

docs/responses.md (2)

22-27: ⚠️ Potential issue | 🟡 Minor

The TOC still points at the old anchor.

The section title is now ## Behavioral Differences, so the #known-limitations-and-behavioral-differences link no longer resolves.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` around lines 22 - 27, The TOC entry still links to the old
anchor '#known-limitations-and-behavioral-differences' but the actual section
header was changed to '## Behavioral Differences'; update the TOC link to match
the new header (replace '#known-limitations-and-behavioral-differences' with
'#behavioral-differences') or rename the header back to the original; adjust the
list item that references "Known Limitations and Behavioral Differences" so its
anchor matches the 'Behavioral Differences' header.
105-107: ⚠️ Potential issue | 🟡 Minor

Document the empty shield_ids case explicitly.

This explains what happens when shield_ids is omitted, but not when callers send []. Please spell out that behavior here as well, otherwise clients cannot distinguish “disable shields” from “invalid request”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` around lines 105 - 107, Update the `shield_ids`
documentation to explicitly describe both cases: when the field is omitted
(current behavior: apply all configured shields in LCORE) and when an empty
array `[]` is provided (explicitly state whether that disables all shields or is
treated as an invalid request), and include a short example or note showing the
exact client payload for each behavior so integrators can distinguish "omit" vs
"empty array" cases for the `shield_ids` parameter.

src/app/endpoints/streaming_query.py (1)

300-316: ⚠️ Potential issue | 🟠 Major

A blocked stream is still being reported as a successful one.

This branch returns a normal generator, so generate_response() later treats the stream as completed and runs the success-only tail, including topic-summary generation for new conversations. Please mark blocked moderation as a non-success path instead of returning a generator that looks completed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 300 - 316, The branch
handling context.moderation_result.decision == "blocked" currently returns
shield_violation_generator which looks like a successful stream; instead raise a
specific non-success signal (e.g., raise
ModerationBlockedError(message=context.moderation_result.message,
moderation_id=context.moderation_result.moderation_id)) or return a clearly
non-success payload that generate_response recognizes, so the success-only tail
(topic-summary generation) is not executed. Update the blocked branch (where
turn_summary is set and append_turn_items_to_conversation is called) to raise or
return that failure indicator and ensure generate_response is updated/checked to
treat ModerationBlockedError (or the chosen failure payload) as a non-success
path rather than a completed generator.

src/app/endpoints/responses.py (1)

341-342: ⚠️ Potential issue | 🟠 Major

Still skip topic-summary generation for blocked turns.

Both flows can still call get_topic_summary() after moderation has already blocked the prompt. That adds another model call over refused content and can persist a summary for a turn that never produced a valid response.

Also applies to: 586-589, 693-695
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 341 - 342, Guard topic-summary
generation behind the moderation result so we never call get_topic_summary or
persist summaries for prompts that were blocked/refused: before passing
generate_topic_summary or invoking get_topic_summary in the response-building
paths (e.g., where generate_topic_summary=request.generate_topic_summary or
where get_topic_summary(...) is called), add a check against the moderation
outcome (e.g., only proceed if moderation_result is not blocked/refused or if
is_blocked/is_refused is False). Apply this same guard in all three locations
referenced (around generate_topic_summary assignment and the calls at the other
two spots) so topic summaries are skipped when moderation has blocked the turn.

🧹 Nitpick comments (1)

src/app/endpoints/responses.py (1)

377-386: Prefer validated construction for the synthetic blocked responses.

ResponsesResponse now has several required echoed fields, and model_construct() will silently emit an incomplete SSE payload if echoed_params() ever drifts. Using normal validation here keeps the blocked streaming path in sync with the public schema.

Suggested change

-    created_response_object = ResponsesResponse.model_construct(
+    created_response_object = ResponsesResponse(
         id=moderation_result.moderation_id,
         created_at=int(created_at.timestamp()),
         status="in_progress",
         output=[],
         conversation=normalized_conv_id,
         available_quotas={},
         output_text="",
         **echoed_params,
     )
@@
-    completed_response_object = ResponsesResponse.model_construct(
+    completed_response_object = ResponsesResponse(
         id=moderation_result.moderation_id,
         created_at=int(created_at.timestamp()),
         completed_at=int(datetime.now(UTC).timestamp()),
         status="completed",
         output=[moderation_result.refusal_response],
         usage=get_zero_usage(),
         conversation=normalized_conv_id,
         available_quotas=available_quotas,
         output_text=moderation_result.message,
         **echoed_params,
     )

Also applies to: 417-428

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 377 - 386, The synthetic
"blocked" response is being built with ResponsesResponse.model_construct(...),
which bypasses Pydantic validation and can produce incomplete SSE payloads if
echoed_params drifts; replace the model_construct usage with normal validated
construction by calling ResponsesResponse(...) so required echoed fields are
enforced, update both occurrences (the block around creation with
id=moderation_result.moderation_id, created_at=..., status="in_progress",
output=[], conversation=normalized_conv_id, available_quotas={}, output_text="",
**echoed_params and the similar block at the later occurrence) and ensure any
missing required keys in echoed_params are provided or handled so the
constructor doesn't raise unexpected exceptions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/endpoints/query.py`:
- Around line 285-291: The blocked-moderation path currently calls
append_turn_items_to_conversation(...) synchronously and a failure there can
return a 5xx instead of the moderation refusal; make this append best-effort by
wrapping the call in a try/except (or equivalent error handling) around the
append_turn_items_to_conversation invocation that references
responses_params.conversation, responses_params.input and
moderation_result.refusal_response, log any exception with context (including
conversation id and that the append failed) and then continue to return the
moderation refusal regardless of append outcome.

In `@src/app/endpoints/responses.py`:
- Around line 98-123: The 200 entry in responses_response only advertises JSON;
add a streaming SSE variant for stream=true by expanding the 200 response to
include a text/event-stream content type alongside application/json (either by
updating ResponsesResponse.openapi_response() to return both media types or
replacing the 200 mapping with an explicit dict that includes content:
{"application/json": {...}, "text/event-stream": {"schema": {"type":"string"},
"description":"SSE stream of events"}}) so the router.post("/responses",
responses=responses_response) OpenAPI map documents the SSE response.

In `@src/utils/responses.py`:
- Around line 1217-1241: The function extract_attachments_text has a misspelled
parameter name response_inputnse_input; rename it to response_input in the
function signature (preserve the type ResponseInput) and update all internal
references (the isinstance check, the for loop, and any casts) to use
response_input so the docstring and code refer to the same identifier.

In `@tests/unit/app/endpoints/test_responses.py`:
- Around line 538-552: Replace the loose mock assertion so the test verifies the
exact refusal turn persisted: instead of mock_append.assert_called_once(), use
mock_append.assert_awaited_once_with(...) (or assert_awaited_once_with) and pass
the expected conversation ID, the original input text from responses_request,
and mock_moderation.refusal_response as the turn payload; keep the call to
responses_endpoint_handler and the final ResponsesResponse type check but assert
the precise arguments to append_turn_items_to_conversation to ensure the
blocked-flow persistence wiring is exercised.

---

Duplicate comments:
In `@docs/responses.md`:
- Around line 22-27: The TOC entry still links to the old anchor
'#known-limitations-and-behavioral-differences' but the actual section header
was changed to '## Behavioral Differences'; update the TOC link to match the new
header (replace '#known-limitations-and-behavioral-differences' with
'#behavioral-differences') or rename the header back to the original; adjust the
list item that references "Known Limitations and Behavioral Differences" so its
anchor matches the 'Behavioral Differences' header.
- Around line 105-107: Update the `shield_ids` documentation to explicitly
describe both cases: when the field is omitted (current behavior: apply all
configured shields in LCORE) and when an empty array `[]` is provided
(explicitly state whether that disables all shields or is treated as an invalid
request), and include a short example or note showing the exact client payload
for each behavior so integrators can distinguish "omit" vs "empty array" cases
for the `shield_ids` parameter.

In `@src/app/endpoints/query.py`:
- Around line 198-219: The topic-summary generation is executed even when
moderation_result.decision != "passed"; move the entire
new-conversation/topic-summary path into the success branch so it only runs on
passed content: inside the block guarded by moderation_result.decision ==
"passed" (where inline_rag_context, turn_summary, rag_chunks, and
referenced_documents are handled), also perform the check for not
user_conversation and query_request.generate_topic_summary and call
get_topic_summary(query_request.query, client, responses_params.model) there;
otherwise set topic_summary = None outside that branch. Ensure you reference the
same symbols (moderation_result.decision, inline_rag_context, turn_summary,
get_topic_summary, query_request.generate_topic_summary) so the logic remains
fully gated by the moderation pass.

In `@src/app/endpoints/responses.py`:
- Around line 341-342: Guard topic-summary generation behind the moderation
result so we never call get_topic_summary or persist summaries for prompts that
were blocked/refused: before passing generate_topic_summary or invoking
get_topic_summary in the response-building paths (e.g., where
generate_topic_summary=request.generate_topic_summary or where
get_topic_summary(...) is called), add a check against the moderation outcome
(e.g., only proceed if moderation_result is not blocked/refused or if
is_blocked/is_refused is False). Apply this same guard in all three locations
referenced (around generate_topic_summary assignment and the calls at the other
two spots) so topic summaries are skipped when moderation has blocked the turn.

In `@src/app/endpoints/streaming_query.py`:
- Around line 300-316: The branch handling context.moderation_result.decision ==
"blocked" currently returns shield_violation_generator which looks like a
successful stream; instead raise a specific non-success signal (e.g., raise
ModerationBlockedError(message=context.moderation_result.message,
moderation_id=context.moderation_result.moderation_id)) or return a clearly
non-success payload that generate_response recognizes, so the success-only tail
(topic-summary generation) is not executed. Update the blocked branch (where
turn_summary is set and append_turn_items_to_conversation is called) to raise or
return that failure indicator and ensure generate_response is updated/checked to
treat ModerationBlockedError (or the chosen failure payload) as a non-success
path rather than a completed generator.

In `@src/utils/endpoints.py`:
- Around line 319-321: Update the log message passed to logger.exception in the
database check for turn existence (the logger.exception call that currently uses
"Database error while checking turn existance for response_id %s") to correct
the typo "existance" to "existence" so the message reads "Database error while
checking turn existence for response_id %s"; keep the existing arguments
(response_id) and call site intact.

In `@src/utils/shields.py`:
- Around line 255-256: The function currently treats an explicit shield_ids ==
[] as "run none" by returning an empty list; instead detect this case and raise
an UnprocessableEntityResponse (HTTP 422) so explicit empty overrides are
rejected; update the logic in the function that references shield_ids to replace
the early return with raising UnprocessableEntityResponse (include a clear
message like "shield_ids may not be an empty list"), and add
UnprocessableEntityResponse to the imports from models.responses so the
exception is available.

In `@src/utils/vector_search.py`:
- Around line 639-672: The function append_inline_rag_context_to_input mutates
the input_value in-place by modifying message.content and part.text; instead,
create and return a new ResponseInput without mutating the caller's objects:
detect whether input_value is a str and return a new string with the context
appended, or if it's a sequence, produce a shallow copy of the list and replace
the first matching ResponseItem (type=="message" and role=="user") with a cloned
ResponseMessage whose content is also cloned (copy the string or clone the
content parts) and with the inline_rag_context_text appended to the cloned
content (modify the cloned part.text, not the original); ensure
append_inline_rag_context_to_input returns this new list (or new string) and
leaves the original input_value, ResponseMessage, and parts unchanged.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 867-877: The test currently doesn't assert the payload passed to
app.endpoints.streaming_query.append_turn_items_to_conversation; change the mock
assertion to use append_turn_items_to_conversation.assert_awaited_once_with(...)
and verify the first arg is mock_responses_params.conversation and the second
arg is mock_moderation_result.refusal_response (use the same mock names as in
the test where retrieve_response_generator is awaited) so the blocked-turn
append is validated explicitly.

---

Nitpick comments:
In `@src/app/endpoints/responses.py`:
- Around line 377-386: The synthetic "blocked" response is being built with
ResponsesResponse.model_construct(...), which bypasses Pydantic validation and
can produce incomplete SSE payloads if echoed_params drifts; replace the
model_construct usage with normal validated construction by calling
ResponsesResponse(...) so required echoed fields are enforced, update both
occurrences (the block around creation with id=moderation_result.moderation_id,
created_at=..., status="in_progress", output=[],
conversation=normalized_conv_id, available_quotas={}, output_text="",
**echoed_params and the similar block at the later occurrence) and ensure any
missing required keys in echoed_params are provided or handled so the
constructor doesn't raise unexpected exceptions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eded7ad2-8e9b-4821-84a2-6809241a4e7e

📥 Commits

Reviewing files that changed from the base of the PR and between 95e12bc and bb1c1b7.

📒 Files selected for processing (16)

docs/openapi.json
docs/responses.md
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/endpoints/test_streaming_query.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/utils/conversations.py
src/models/requests.py

src/app/endpoints/query.py

src/app/endpoints/responses.py

src/utils/responses.py

tests/unit/app/endpoints/test_responses.py

asimurka · 2026-03-13T08:01:02Z

src/app/endpoints/conversations_v1.py

            llama_stack_conv_id,
        )

-        # Use Conversations API to retrieve conversation items


Extracted to get_all_conversation_items

asimurka · 2026-03-13T08:01:29Z

src/app/endpoints/conversations_v1.py

            len(items),
            conversation_id,
        )
-        # Retrieve turns metadata from database


Extracted to retrieve_conversation_turns

asimurka · 2026-03-13T08:04:53Z

src/models/requests.py

    stream: bool = False
    temperature: Optional[float] = None
    text: Optional[Text] = None
-    tool_choice: Optional[ToolChoice] = ToolChoiceMode.auto


This is default in LLS

asimurka · 2026-03-13T08:37:43Z

src/app/endpoints/responses.py

+        None,  # provider specified as model prefix
+        request.state.authorized_actions,
+    )
+


Conversation management logic extracted to utility function.

tisnik · 2026-03-13T08:59:38Z

@coderabbitai review

coderabbitai · 2026-03-13T08:59:48Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (7)

docs/responses.md (1)

106-107: ⚠️ Potential issue | 🟡 Minor

Clarify shield_ids=[] behavior explicitly.

Please document that an empty list means “run no shields” (while omitted means “use all configured shields”) to avoid client-side misinterpretation.

Also applies to: 523-524
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` around lines 106 - 107, Update the API docs for the
`shield_ids` field to explicitly state the semantics of an empty list versus
omission: document that `shield_ids=[]` means "run no shields" while omitting
`shield_ids` (no key present) means "use all configured shields in LCORE"; apply
this clarification to both occurrences of the `shield_ids` row (the existing
table entry and the duplicate at lines referenced) so clients won't misinterpret
an empty array as "use defaults."

src/app/endpoints/streaming_query.py (1)

303-319: ⚠️ Potential issue | 🟠 Major

Blocked moderation path still looks like a successful stream path.

This branch returns a regular generator, so downstream completion logic can still execute success-only side effects (notably topic-summary generation) for blocked turns.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 303 - 319, The blocked
moderation branch currently returns a normal generator so downstream
success-only completion logic can still run; change this to return a stream that
signals failure immediately by returning an async generator which raises a
specific error (e.g., ModerationBlockedError) on first iteration instead of a
regular success stream. Update the code path that returns
shield_violation_generator(...) (involving context.moderation_result,
turn_summary, append_turn_items_to_conversation, MEDIA_TYPE_JSON) to return a
generator that raises when consumed (or replace shield_violation_generator with
a new async generator like moderation_blocked_stream that raises
ModerationBlockedError) so downstream completion/summary logic will detect the
failure and skip success-only side effects. Ensure the error type/message
includes context.moderation_result.message and preserve the current media_type
handling.

src/app/endpoints/query.py (1)

285-294: ⚠️ Potential issue | 🟠 Major

Don’t let refusal persistence failures override the moderation response.

The blocked path should return the refusal even if conversation-item persistence fails. Right now, a storage/API failure can surface as 5xx and hide the moderation outcome.

Suggested fix

     if moderation_result.decision == "blocked":
-        await append_turn_items_to_conversation(
-            client,
-            responses_params.conversation,
-            responses_params.input,
-            [moderation_result.refusal_response],
-        )
+        try:
+            await append_turn_items_to_conversation(
+                client,
+                responses_params.conversation,
+                responses_params.input,
+                [moderation_result.refusal_response],
+            )
+        except HTTPException as exc:
+            logger.warning(
+                "Failed to persist blocked moderation turn for conversation %s: %s",
+                responses_params.conversation,
+                exc.detail,
+            )
         return TurnSummary(
             id=moderation_result.moderation_id, llm_response=moderation_result.message
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 285 - 294, The blocked-moderation
branch currently calls append_turn_items_to_conversation and lets any
persistence/API exception bubble up, which can turn a moderation refusal into a
5xx error; wrap the call to append_turn_items_to_conversation in a try/except
(or try/except Exception) inside the moderation_result.decision == "blocked"
branch, log the exception (don't re-raise), and always return the TurnSummary
constructed from moderation_result (use moderation_result.moderation_id and
moderation_result.message) so the refusal is returned even if persistence fails.

src/utils/conversations.py (1)

476-478: ⚠️ Potential issue | 🟠 Major

Preserve upstream API status/details instead of always returning generic 500.

Both exception blocks currently collapse every APIStatusError into the same internal error. That masks actionable upstream statuses (e.g., 4xx) and makes failures harder to diagnose and handle correctly.

Suggested fix

-    except APIStatusError as e:
-        error_response = InternalServerErrorResponse.generic()
-        raise HTTPException(**error_response.model_dump()) from e
+    except APIStatusError as e:
+        status_code = (
+            getattr(e, "status_code", None)
+            or getattr(getattr(e, "response", None), "status_code", 500)
+        )
+        detail = getattr(getattr(e, "response", None), "text", None) or str(e)
+        raise HTTPException(status_code=status_code, detail=detail) from e

Also applies to: 512-514

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/conversations.py` around lines 476 - 478, Currently APIStatusError
is always converted to InternalServerErrorResponse.generic() before raising
HTTPException; change the handler in src/utils/conversations.py to preserve the
upstream status and details: in the except APIStatusError as e blocks extract
the upstream HTTP status (e.g., e.status_code or e.response.status_code), any
error body/message (e.g., e.response.json() or str(e)), and headers if present,
then raise HTTPException(status_code=<upstream status>, detail=<upstream
detail>, headers=<upstream headers>) instead of using
InternalServerErrorResponse.generic(); apply the same change to the other
APIStatusError handler around the 512-514 area so 4xx/5xx upstream info is
propagated rather than always returning a generic 500.

src/app/endpoints/responses.py (2)

586-589: ⚠️ Potential issue | 🟠 Major

Skip topic-summary generation when moderation blocks the turn.

Both streaming and non-streaming flows still call get_topic_summary() for blocked requests. That causes avoidable model calls and can persist summaries for refused content.

🛡️ Guard summary generation by moderation outcome

-    return StreamingResponse(
+    should_generate_topic_summary = (
+        (request.generate_topic_summary or False)
+        and moderation_result.decision == "passed"
+    )
+    return StreamingResponse(
         generate_response(
             generator=generator,
             turn_summary=turn_summary,
             client=client,
             auth=auth,
             input_text=input_text,
             started_at=started_at,
             api_params=api_params,
-            generate_topic_summary=request.generate_topic_summary or False,
+            generate_topic_summary=should_generate_topic_summary,
         ),
         media_type="text/event-stream",
     )

-    if request.generate_topic_summary:
+    if request.generate_topic_summary and moderation_result.decision == "passed":
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(input_text, client, api_params.model)

Also applies to: 693-696

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 586 - 589, The topic-summary
generation is being triggered even for moderation-blocked turns; before calling
get_topic_summary (in the generate_topic_summary branch around where
topic_summary is set), check the moderation outcome (e.g., moderation or
moderation_result) and skip calling get_topic_summary when the request is
blocked/refused; apply the same guard to the other occurrence (the second
generate_topic_summary block around lines 693-696) so summaries are not created
for moderated content.

98-100: ⚠️ Potential issue | 🟠 Major

Document the SSE 200 variant in OpenAPI.

/responses can return text/event-stream (when stream=true), but 200 currently advertises only JSON. Client generation and docs will be inaccurate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/responses.py` around lines 98 - 100, The OpenAPI responses
map currently only registers the JSON 200 response via
ResponsesResponse.openapi_response() causing docs/clients to miss the SSE
variant; update the 200 entry in responses_response to include both the JSON and
the text/event-stream variants (e.g. add the streaming SSE response from the
streaming response helper, such as a ResponsesStreamResponse.openapi_response()
or by extending ResponsesResponse.openapi_response() to include
"text/event-stream" media type), ensuring the 200 response object lists the
"application/json" and "text/event-stream" content entries so the /responses
endpoint advertises the SSE (stream=true) variant.

src/models/requests.py (1)

737-743: ⚠️ Potential issue | 🟠 Major

previous_response_id validation blocks valid blocked-turn follow-ups.

Blocked responses are emitted with moderation IDs (src/app/endpoints/responses.py, Line 378 and Line 635). Rejecting IDs with "modr" prefix here breaks previous_response_id=response.id flows after moderation.

🔧 Suggested fix

-    `@field_validator`("previous_response_id")
-    `@classmethod`
-    def check_previous_response_id(cls, value: Optional[str]) -> Optional[str]:
-        """Validate that previous_response_id does not start with 'modr'."""
-        if value is not None and value.startswith("modr"):
-            raise ValueError("You cannot provide context by moderation response.")
-        return value
+    `@field_validator`("previous_response_id")
+    `@classmethod`
+    def check_previous_response_id(cls, value: Optional[str]) -> Optional[str]:
+        """Validate previous_response_id format."""
+        return value

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/requests.py` around lines 737 - 743, The current validator
check_previous_response_id rejects any previous_response_id starting with
"modr", which prevents legitimate follow-ups that reference moderation-generated
response IDs; update the validator on the previous_response_id field in the same
class so it no longer raises for values that start with "modr" (i.e., remove or
relax the startswith("modr") rejection) while preserving any other existing
validations and return the value unchanged. Ensure you only change the logic
inside check_previous_response_id and do not alter the field name or decorator
so other code referencing previous_response_id continues to work.

🧹 Nitpick comments (7)

tests/e2e/test_list.txt (1)

1-20: Consider alphabetizing the test list for easier maintenance.

The test list is not currently in alphabetical order. While not a blocker, maintaining alphabetical order would make it easier to navigate and prevent duplicate entries in the future.

📝 Alphabetically sorted version

 features/authorized_noop.feature
 features/authorized_noop_token.feature
 features/authorized_rh_identity.feature
+features/conversation_cache_v2.feature
+features/conversations.feature
 features/faiss.feature
+features/feedback.feature
+features/health.feature
+features/info.feature
+features/mcp.feature
+features/mcp_file_auth.feature
+features/models.feature
+features/query.feature
+features/rbac.feature
+features/responses.feature
+features/rest_api.feature
+features/rlsapi_v1.feature
+features/rlsapi_v1_errors.feature
 features/smoketests.feature
-features/authorized_noop.feature
-features/authorized_noop_token.feature
-features/authorized_rh_identity.feature
-features/rbac.feature
-features/conversations.feature
-features/conversation_cache_v2.feature
-features/feedback.feature
-features/health.feature
-features/info.feature
-features/responses.feature
-features/query.feature
-features/rlsapi_v1.feature
-features/rlsapi_v1_errors.feature
 features/streaming_query.feature
-features/rest_api.feature
-features/mcp.feature
-features/mcp_file_auth.feature
-features/models.feature

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_list.txt` around lines 1 - 20, The test list is unsorted; open
the test_list file and alphabetize the lines containing the feature file entries
(the lines like "features/faiss.feature", "features/smoketests.feature", etc.),
remove any exact duplicate entries, and save with a single trailing newline
preserved; ensure the ordering matches ASCII alphabetical order so entries like
"authorized_*" appear before "conversations.feature" and verify no other
non-feature lines are reordered or removed.

src/utils/suid.py (1)

108-113: Align this new utility docstring with Google-style sections.

Please add explicit Parameters and Returns sections for consistency with project docstring standards.

As per coding guidelines, "Follow Google Python docstring conventions ... with Parameters, Returns, Raises, and Attributes sections".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/suid.py` around lines 108 - 113, Update the docstring for the
is_moderation_id function to follow Google-style sections: add a Parameters
section documenting the suid: str argument and its meaning, and add a Returns
section describing the returned bool value (True when suid starts with 'modr',
False otherwise); keep the one-line summary and ensure types and descriptions
match existing project conventions in the is_moderation_id docstring.

docs/responses.md (1)

28-35: Add the new “Implicit Conversation Management” section to the TOC.

The section exists now, but it’s not listed in the table of contents.

Also applies to: 535-535

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/responses.md` around lines 28 - 35, TOC is missing the new "Implicit
Conversation Management" heading; add an entry linking to that section (use the
exact anchor "#implicit-conversation-management") into the existing list of
top-level items (near the entries like "Streaming Request" and "Error Handling")
so the table of contents includes "Implicit Conversation Management" with proper
nesting consistent with other entries.

src/models/responses.py (1)

1565-1576: Fail fast when JSON examples are missing in openapi_response().

The current fallback to {} can silently degrade generated docs. Consider raising SchemaError (same pattern as AbstractSuccessfulResponse) when the primary JSON example is absent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/responses.py` around lines 1565 - 1576, openapi_response currently
falls back to an empty JSON example when model_json_schema() has no "examples",
which should instead raise a SchemaError like AbstractSuccessfulResponse does;
update openapi_response to check json_example after computing
model_json_schema()/model_examples and if missing raise SchemaError with a clear
message (include cls.__name__ or the model identifier) rather than using an
empty dict, and keep the rest of the logic (schema_extra / sse_example / content
assembly) unchanged so SSE handling remains intact.

src/models/database/conversations.py (1)

34-35: Use optional typing for nullable columns.

nullable=True columns are currently typed as non-null Mapped[str], which is inconsistent with runtime values and weakens type safety.

Suggested typing fix

+from typing import Optional
@@
-    last_response_id: Mapped[str] = mapped_column(nullable=True)
+    last_response_id: Mapped[Optional[str]] = mapped_column(nullable=True)
@@
-    response_id: Mapped[str] = mapped_column(nullable=True, index=True)
+    response_id: Mapped[Optional[str]] = mapped_column(nullable=True, index=True)

As per coding guidelines, "Include complete type annotations for all class attributes".

Also applies to: 71-73

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/models/database/conversations.py` around lines 34 - 35, The nullable
columns in the Conversations model (e.g., last_response_id and the other
attributes marked nullable=True around lines 71-73) are typed as Mapped[str] but
must be optional; change their annotations to Mapped[Optional[str]] (or
Mapped[Optional[Type]] for non-str columns), import Optional from typing, and
update each attribute declaration that has nullable=True to use Optional in its
Mapped[...] type so the type hints match runtime nullability.

tests/unit/utils/test_endpoints.py (1)

456-716: Add one explicit mutual-exclusivity test for context strategies.

This suite is comprehensive; one extra case where both conversation_id and previous_response_id are provided would lock in the contract and prevent regressions around strategy precedence/validation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_endpoints.py` around lines 456 - 716, Add a unit test
that calls resolve_response_context with both conversation_id and
previous_response_id set and asserts the function enforces mutual exclusivity by
raising an HTTPException (status_code 400); mock AsyncLlamaStackClientHolder as
in other tests, call endpoints.resolve_response_context(user_id="user-1",
others_allowed=False, conversation_id="conv-raw",
previous_response_id="resp-123", generate_topic_summary=None) and assert
exc_info.value.status_code == 400 and that the exception detail mentions both
"conversation_id" and "previous_response_id" to lock in the validation contract.

src/utils/endpoints.py (1)

255-260: Avoid double DB lookup for previous_response_id.

check_turn_existence() and retrieve_turn_by_response_id() query the same row. retrieve_turn_by_response_id() already handles missing rows with 404, so the pre-check is redundant.

♻️ Simplify to single lookup

-        if not check_turn_existence(previous_response_id):
-            error_response = NotFoundResponse(
-                resource="response", resource_id=previous_response_id
-            )
-            raise HTTPException(**error_response.model_dump())
-        prev_user_turn = retrieve_turn_by_response_id(previous_response_id)
+        prev_user_turn = retrieve_turn_by_response_id(previous_response_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/endpoints.py` around lines 255 - 260, Remove the redundant
pre-check: stop calling check_turn_existence(previous_response_id) before
retrieve_turn_by_response_id(previous_response_id) since
retrieve_turn_by_response_id already raises a 404; instead directly call
retrieve_turn_by_response_id(previous_response_id) and let it raise the
HTTPException/NotFoundResponse as needed, removing the NotFoundResponse
construction and raise around the check to avoid the double DB lookup and
duplicated logic (refer to functions check_turn_existence,
retrieve_turn_by_response_id and variable previous_response_id).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/endpoints/responses.py`:
- Line 12: ResponsesRequest.tool_choice is an OpenAIResponseInputToolChoiceMode
enum, but the code imports and compares against ToolChoice (from
llama_stack_api) so the check in the no_tools logic (where ToolChoice.none is
used) always fails; fix by importing and using the matching enum
OpenAIResponseInputToolChoiceMode (or replace the comparison with
OpenAIResponseInputToolChoiceMode.none) and update the import line that
currently brings in ToolChoice to instead import
OpenAIResponseInputToolChoiceMode (or add it) so the comparison in the no_tools
check and any other tool_choice comparisons use the same enum type.

In `@src/models/database/conversations.py`:
- Around line 71-73: Make response_id unique so a previous_response_id maps to
exactly one turn: update the Conversation model's response_id mapped_column to
enforce uniqueness (e.g., add unique=True or add a UniqueConstraint on
response_id) while keeping nullable semantics, and add a corresponding DB
migration that handles existing duplicates/nulls (clean or dedupe rows before
applying the constraint). Target symbols: response_id (mapped_column on the
Conversation model) and any migration script you generate to alter the schema.

In `@src/utils/query.py`:
- Around line 294-295: The code is persisting summary.id directly which allows
empty strings to be stored into UserTurn.response_id and propagate to
UserConversation.last_response_id; update the logic around where
response_id=summary.id is set (and the other occurrences noted) to first
normalize and guard: treat empty/whitespace summary.id as None/NULL (or skip
updating) and only write/update UserTurn.response_id and
UserConversation.last_response_id when summary.id is a non-empty valid string;
apply this check in the functions/methods that set response_id (locations
handling summary.id assignment and any update_last_response_id logic) so empty
values are never persisted.

In `@tests/unit/app/endpoints/test_conversations.py`:
- Around line 588-592: The current assertions only check that
exc_info.value.status_code equals status.HTTP_500_INTERNAL_SERVER_ERROR and that
"response" is present in detail, which is too permissive; update the assertions
around exc_info and detail to validate the response shape/content more strictly
by asserting that detail["response"] is a dict (or expected type) and contains
expected keys/values (for example assert "message" or "error" in
detail["response"] and that detail["response"]["message"] contains the expected
error substring or code), and/or assert a specific substring in
str(detail["response"]) to ensure the failure is the one intended; make these
changes where exc_info, detail, and status.HTTP_500_INTERNAL_SERVER_ERROR are
used so regressions are caught.

In `@tests/unit/app/endpoints/test_responses.py`:
- Around line 764-767: The test uses a regular assert_called_once on an
AsyncMock (append_turn_items_to_conversation) which can pass even if not
awaited; change to awaiting the AsyncMock assertion (use await
mock_append.assert_awaited_once()) and read its awaited call args via
mock_append.await_args (e.g., positional args = mock_append.await_args[0])
instead of mock_append.call_args; apply the same change for the other occurrence
of this AsyncMock in the file.

In `@tests/unit/utils/test_shields.py`:
- Around line 619-745: The TestAppendTurnItemsRefusalCases class is a duplicate
that only calls get_shields_for_request instead of testing
append_turn_items_to_conversation; remove or replace it: either delete the
entire TestAppendTurnItemsRefusalCases class (and its methods) to avoid
redundant tests, or rewrite the class to actually exercise
append_turn_items_to_conversation by creating fixtures/mocks for a conversation
turn, a single output item marked as a refusal, and asserting the conversation
is updated accordingly; reference the class name TestAppendTurnItemsRefusalCases
and the function append_turn_items_to_conversation to locate where to change
tests, and remove usages of get_shields_for_request within that class.

---

Duplicate comments:
In `@docs/responses.md`:
- Around line 106-107: Update the API docs for the `shield_ids` field to
explicitly state the semantics of an empty list versus omission: document that
`shield_ids=[]` means "run no shields" while omitting `shield_ids` (no key
present) means "use all configured shields in LCORE"; apply this clarification
to both occurrences of the `shield_ids` row (the existing table entry and the
duplicate at lines referenced) so clients won't misinterpret an empty array as
"use defaults."

In `@src/app/endpoints/query.py`:
- Around line 285-294: The blocked-moderation branch currently calls
append_turn_items_to_conversation and lets any persistence/API exception bubble
up, which can turn a moderation refusal into a 5xx error; wrap the call to
append_turn_items_to_conversation in a try/except (or try/except Exception)
inside the moderation_result.decision == "blocked" branch, log the exception
(don't re-raise), and always return the TurnSummary constructed from
moderation_result (use moderation_result.moderation_id and
moderation_result.message) so the refusal is returned even if persistence fails.

In `@src/app/endpoints/responses.py`:
- Around line 586-589: The topic-summary generation is being triggered even for
moderation-blocked turns; before calling get_topic_summary (in the
generate_topic_summary branch around where topic_summary is set), check the
moderation outcome (e.g., moderation or moderation_result) and skip calling
get_topic_summary when the request is blocked/refused; apply the same guard to
the other occurrence (the second generate_topic_summary block around lines
693-696) so summaries are not created for moderated content.
- Around line 98-100: The OpenAPI responses map currently only registers the
JSON 200 response via ResponsesResponse.openapi_response() causing docs/clients
to miss the SSE variant; update the 200 entry in responses_response to include
both the JSON and the text/event-stream variants (e.g. add the streaming SSE
response from the streaming response helper, such as a
ResponsesStreamResponse.openapi_response() or by extending
ResponsesResponse.openapi_response() to include "text/event-stream" media type),
ensuring the 200 response object lists the "application/json" and
"text/event-stream" content entries so the /responses endpoint advertises the
SSE (stream=true) variant.

In `@src/app/endpoints/streaming_query.py`:
- Around line 303-319: The blocked moderation branch currently returns a normal
generator so downstream success-only completion logic can still run; change this
to return a stream that signals failure immediately by returning an async
generator which raises a specific error (e.g., ModerationBlockedError) on first
iteration instead of a regular success stream. Update the code path that returns
shield_violation_generator(...) (involving context.moderation_result,
turn_summary, append_turn_items_to_conversation, MEDIA_TYPE_JSON) to return a
generator that raises when consumed (or replace shield_violation_generator with
a new async generator like moderation_blocked_stream that raises
ModerationBlockedError) so downstream completion/summary logic will detect the
failure and skip success-only side effects. Ensure the error type/message
includes context.moderation_result.message and preserve the current media_type
handling.

In `@src/models/requests.py`:
- Around line 737-743: The current validator check_previous_response_id rejects
any previous_response_id starting with "modr", which prevents legitimate
follow-ups that reference moderation-generated response IDs; update the
validator on the previous_response_id field in the same class so it no longer
raises for values that start with "modr" (i.e., remove or relax the
startswith("modr") rejection) while preserving any other existing validations
and return the value unchanged. Ensure you only change the logic inside
check_previous_response_id and do not alter the field name or decorator so other
code referencing previous_response_id continues to work.

In `@src/utils/conversations.py`:
- Around line 476-478: Currently APIStatusError is always converted to
InternalServerErrorResponse.generic() before raising HTTPException; change the
handler in src/utils/conversations.py to preserve the upstream status and
details: in the except APIStatusError as e blocks extract the upstream HTTP
status (e.g., e.status_code or e.response.status_code), any error body/message
(e.g., e.response.json() or str(e)), and headers if present, then raise
HTTPException(status_code=<upstream status>, detail=<upstream detail>,
headers=<upstream headers>) instead of using
InternalServerErrorResponse.generic(); apply the same change to the other
APIStatusError handler around the 512-514 area so 4xx/5xx upstream info is
propagated rather than always returning a generic 500.

---

Nitpick comments:
In `@docs/responses.md`:
- Around line 28-35: TOC is missing the new "Implicit Conversation Management"
heading; add an entry linking to that section (use the exact anchor
"#implicit-conversation-management") into the existing list of top-level items
(near the entries like "Streaming Request" and "Error Handling") so the table of
contents includes "Implicit Conversation Management" with proper nesting
consistent with other entries.

In `@src/models/database/conversations.py`:
- Around line 34-35: The nullable columns in the Conversations model (e.g.,
last_response_id and the other attributes marked nullable=True around lines
71-73) are typed as Mapped[str] but must be optional; change their annotations
to Mapped[Optional[str]] (or Mapped[Optional[Type]] for non-str columns), import
Optional from typing, and update each attribute declaration that has
nullable=True to use Optional in its Mapped[...] type so the type hints match
runtime nullability.

In `@src/models/responses.py`:
- Around line 1565-1576: openapi_response currently falls back to an empty JSON
example when model_json_schema() has no "examples", which should instead raise a
SchemaError like AbstractSuccessfulResponse does; update openapi_response to
check json_example after computing model_json_schema()/model_examples and if
missing raise SchemaError with a clear message (include cls.__name__ or the
model identifier) rather than using an empty dict, and keep the rest of the
logic (schema_extra / sse_example / content assembly) unchanged so SSE handling
remains intact.

In `@src/utils/endpoints.py`:
- Around line 255-260: Remove the redundant pre-check: stop calling
check_turn_existence(previous_response_id) before
retrieve_turn_by_response_id(previous_response_id) since
retrieve_turn_by_response_id already raises a 404; instead directly call
retrieve_turn_by_response_id(previous_response_id) and let it raise the
HTTPException/NotFoundResponse as needed, removing the NotFoundResponse
construction and raise around the check to avoid the double DB lookup and
duplicated logic (refer to functions check_turn_existence,
retrieve_turn_by_response_id and variable previous_response_id).

In `@src/utils/suid.py`:
- Around line 108-113: Update the docstring for the is_moderation_id function to
follow Google-style sections: add a Parameters section documenting the suid: str
argument and its meaning, and add a Returns section describing the returned bool
value (True when suid starts with 'modr', False otherwise); keep the one-line
summary and ensure types and descriptions match existing project conventions in
the is_moderation_id docstring.

In `@tests/e2e/test_list.txt`:
- Around line 1-20: The test list is unsorted; open the test_list file and
alphabetize the lines containing the feature file entries (the lines like
"features/faiss.feature", "features/smoketests.feature", etc.), remove any exact
duplicate entries, and save with a single trailing newline preserved; ensure the
ordering matches ASCII alphabetical order so entries like "authorized_*" appear
before "conversations.feature" and verify no other non-feature lines are
reordered or removed.

In `@tests/unit/utils/test_endpoints.py`:
- Around line 456-716: Add a unit test that calls resolve_response_context with
both conversation_id and previous_response_id set and asserts the function
enforces mutual exclusivity by raising an HTTPException (status_code 400); mock
AsyncLlamaStackClientHolder as in other tests, call
endpoints.resolve_response_context(user_id="user-1", others_allowed=False,
conversation_id="conv-raw", previous_response_id="resp-123",
generate_topic_summary=None) and assert exc_info.value.status_code == 400 and
that the exception detail mentions both "conversation_id" and
"previous_response_id" to lock in the validation contract.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0c721055-72b9-4ba1-b1da-06563442cd90

📥 Commits

Reviewing files that changed from the base of the PR and between 95e12bc and 7142479.

📒 Files selected for processing (29)

docs/openapi.json
docs/responses.md
src/app/endpoints/conversations_v1.py
src/app/endpoints/query.py
src/app/endpoints/responses.py
src/app/endpoints/streaming_query.py
src/app/routers.py
src/models/database/conversations.py
src/models/requests.py
src/models/responses.py
src/utils/conversations.py
src/utils/endpoints.py
src/utils/query.py
src/utils/responses.py
src/utils/shields.py
src/utils/suid.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/responses.feature
tests/e2e/test_list.txt
tests/integration/test_openapi_json.py
tests/unit/app/endpoints/test_conversations.py
tests/unit/app/endpoints/test_responses.py
tests/unit/app/test_routers.py
tests/unit/utils/test_conversations.py
tests/unit/utils/test_endpoints.py
tests/unit/utils/test_query.py
tests/unit/utils/test_responses.py
tests/unit/utils/test_shields.py

💤 Files with no reviewable changes (1)

src/utils/shields.py

🚧 Files skipped from review as they are similar to previous changes (4)

tests/e2e/features/responses.feature
src/utils/vector_search.py
tests/unit/utils/test_query.py
tests/unit/utils/test_conversations.py

src/app/endpoints/responses.py

src/models/database/conversations.py

src/utils/query.py

tests/unit/app/endpoints/test_conversations.py

tests/unit/app/endpoints/test_responses.py

tests/unit/utils/test_shields.py

tisnik

LGTM (but for the records, there is high probability that some problems were not found)

are-ces

Great work. Left couple of nits, as we discussed there is also a small bug to fix in the tool definition in the response.

are-ces · 2026-03-13T09:09:55Z

src/utils/vector_search.py

    return rag_chunks
+
+
+def append_inline_rag_context_to_input(


Nit: I would change the name to indicate that this function is for the responses API

are-ces · 2026-03-13T09:17:45Z

src/app/endpoints/responses.py

+        responses_request.solr,
+    )
+    if moderation_result.decision == "passed":
+        responses_request.input = append_inline_rag_context_to_input(


Should we use this method also for query and streaming_query endpoints

Yes, but it will require further refactoring as queries append inline RAG chunks and attachments at the same place. This is IMO out of scope of this PR

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

asimurka force-pushed the conversation_management branch from 63592ab to f553d06 Compare March 10, 2026 15:42

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

src/app/endpoints/query.py Show resolved Hide resolved

src/models/requests.py Show resolved Hide resolved

src/utils/types.py Outdated Show resolved Hide resolved

asimurka force-pushed the conversation_management branch from f553d06 to aa6c416 Compare March 10, 2026 16:14

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

docs/responses.md Show resolved Hide resolved

src/app/endpoints/responses.py Show resolved Hide resolved

src/app/endpoints/responses.py Show resolved Hide resolved

src/utils/shields.py Show resolved Hide resolved

asimurka marked this pull request as draft March 10, 2026 18:20

asimurka commented Mar 11, 2026

View reviewed changes

asimurka force-pushed the conversation_management branch 2 times, most recently from 1f4c764 to 95e12bc Compare March 11, 2026 09:39

asimurka marked this pull request as ready for review March 11, 2026 09:40

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/utils/vector_search.py Outdated Show resolved Hide resolved

asimurka force-pushed the conversation_management branch from 95e12bc to bb1c1b7 Compare March 11, 2026 12:16

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/app/endpoints/query.py Show resolved Hide resolved

src/app/endpoints/responses.py Show resolved Hide resolved

src/utils/responses.py Outdated Show resolved Hide resolved

tests/unit/app/endpoints/test_responses.py Show resolved Hide resolved

asimurka force-pushed the conversation_management branch 5 times, most recently from c16088b to 7142479 Compare March 13, 2026 07:48

asimurka commented Mar 13, 2026

View reviewed changes

asimurka requested review from are-ces and tisnik March 13, 2026 08:07

asimurka commented Mar 13, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

tisnik approved these changes Mar 13, 2026

View reviewed changes

are-ces requested changes Mar 13, 2026

View reviewed changes

asimurka added 2 commits March 13, 2026 13:26

Implementation of POST v1/responses endpoint

f3fa114

Fixed tool choice setup, addressed comments

180dd49

asimurka force-pushed the conversation_management branch from 7142479 to 180dd49 Compare March 13, 2026 12:37

Conversation

asimurka commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tisnik commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asimurka commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading