LCORE-1409: Refactor of shield moderation and inline RAG content persistence by asimurka · Pull Request #1291 · lightspeed-core/lightspeed-stack

asimurka · 2026-03-09T13:17:45Z

Description

This PR refactors flow of shield moderation. The moderation is now executed before inline RAG so that we avoid doing expensive RAG operations if the input was blocked by shields. It also proposes refactor of the main run_moderation function.
Finally, it fixes inconsistency between query and streaming query persistence of context from inline RAG. It also adds unit tests that verify that rag contents of both types is correctly merged.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Cursor

Related Tickets & Documents

Related Issue #
Closes # LCORE-1409

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Moderation now runs earlier and can block unsafe requests; streaming responses reflect moderation decisions in real time.
Improvements
- Clearer, consistent refusal message for policy-blocked requests.
- Retrieved-document merging and deduplication applied when moderation passes to reduce duplicate references.
- Improved error mapping for more consistent API error responses.
Chores
- Platform compatibility version bumped.

coderabbitai · 2026-03-09T13:18:03Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Moderation is run early and its ShieldModerationResult is threaded through query and streaming endpoints. If blocked, handlers append a refusal turn and return a moderation-driven TurnSummary; if passed, inline and tool RAG contexts are built/merged and normal response generation proceeds.

Changes

Cohort / File(s)	Summary
Query & Streaming Endpoints `src/app/endpoints/query.py`, `src/app/endpoints/streaming_query.py`	Thread `moderation_result` through request flow. `retrieve_response` now accepts `moderation_result: ShieldModerationResult`. Streaming generator/context use `moderation_result`; inline RAG param renamed to `inline_rag_docs`. Blocked path appends refusal via `append_turn_items_to_conversation` and returns moderation-driven TurnSummary.
Context, Types & Constants `src/models/context.py`, `src/utils/types.py`, `src/constants.py`	Added `moderation_result: ShieldModerationResult` to `ResponseGeneratorContext`. Removed `inline_rag_documents` from `TurnSummary`. Bumped `MAXIMAL_SUPPORTED_LLAMA_STACK_VERSION` and added `DEFAULT_VIOLATION_MESSAGE`.
Shields & Moderation Utilities `src/utils/shields.py`, `tests/unit/utils/test_shields.py`	Added `get_shields_for_request()` to centralize shield retrieval and filtering; standardized HTTP error mappings (404/503/500). `run_shield_moderation` delegates shield selection to the helper; tests updated to cover new behaviors.
Vector Search / RAG `src/utils/vector_search.py`, `tests/unit/utils/test_vector_search.py`	`build_rag_context` signature gains `moderation_decision` and returns an empty RAGContext when decision == `"blocked"`; call sites updated in tests.
Conversation Helpers `src/utils/conversations.py`, `tests/unit/app/endpoints/*`	Added `append_turn_items_to_conversation()` async helper to append user/LLM items and map API errors to HTTPExceptions; tests updated to mock/patch this new function (replacing prior `append_turn_to_conversation`).
Responses / Extraction `src/utils/responses.py`	Safer content extraction using guarded attribute access (getattr) in `_extract_text_from_content`.
Tests & E2E `tests/unit/app/endpoints/test_query.py`, `tests/unit/app/endpoints/test_streaming_query.py`, `tests/e2e/features/info.feature`	Tests updated to supply `ShieldModerationPassed()`/`moderation_result`, rename mocks to `append_turn_items_to_conversation`, adjust response-generator invocations and assertions; E2E info expected Llama Stack version bumped to 0.5.2.
Deps / Config `pyproject.toml`	Bumped Llama Stack packages to `0.5.2`.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client/User
    participant Endpoint as Query/Streaming Endpoint
    participant Shield as ShieldService
    participant Vector as VectorSearch
    participant Tools as ToolRAG
    participant Responses as ResponsesAPI
    participant Conv as Conversations

    Client->>Endpoint: Send user query
    Endpoint->>Shield: run_shield_moderation(query)
    Shield-->>Endpoint: moderation_result {decision, message, ids}
    alt moderation_result.decision == "blocked"
        Endpoint->>Conv: append_turn_items_to_conversation(user_input, refusal)
        Endpoint-->>Client: Return TurnSummary (moderation message)
    else decision == "passed"
        Endpoint->>Vector: build_rag_context(moderation_decision="passed", query)
        Vector-->>Endpoint: inline_rag_context
        Endpoint->>Tools: fetch tool-based RAG (if tools configured)
        Tools-->>Endpoint: tool_rag_results
        Endpoint->>Responses: prepare_responses_params(inline_context.context_text, tools)
        Responses-->>Endpoint: LLM response + referenced_documents
        Endpoint->>Conv: append_turn_items_to_conversation(user_input, llm_output)
        Endpoint-->>Client: Return TurnSummary (id, response, merged referenced_documents)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

LCORE-849: Shields implementation using Moderations API #979 — Implements a moderation-first Shields/Moderations workflow with similar shield moderation wiring and conversation append changes.
LCORE-1350: Refactor of query and shields utility functions #1223 — Overlapping refactors to moderation/shields, types, and query endpoint wiring (changes to build_rag_context and shields utilities).

Suggested reviewers

are-ces
tisnik

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 97.14% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title clearly and specifically describes the main refactor: reorganizing shield moderation to execute before inline RAG, with the addition of content persistence improvements.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tests/unit/utils/test_vector_search.py (1)

455-500: ⚠️ Potential issue | 🟡 Minor

Add a regression test for the new "blocked" branch.

These updates only cover the "passed" path, but build_rag_context() now has a moderation short-circuit that skips inline RAG entirely. Please add a case that passes "blocked" and asserts the returned RAGContext is empty and _fetch_byok_rag() / _fetch_solr_rag() are not called.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_vector_search.py` around lines 455 - 500, Add a new
async test that calls build_rag_context(client_mock, "blocked", "any query",
None) to cover the moderation short-circuit branch, mock the configuration to
enable inline BYOK and inline Solr as in the other tests, and assert the
returned RAGContext has empty context_text, rag_chunks, and
referenced_documents; also assert that the helper methods _fetch_byok_rag and
_fetch_solr_rag (mocked or spied on) were not called. Use the same test file
tests/unit/utils/test_vector_search.py and name the test something like
test_moderation_blocked_short_circuits_inline; reference build_rag_context,
_fetch_byok_rag, and _fetch_solr_rag when locating the code to mock/spy.

src/app/endpoints/query.py (1)

198-221: ⚠️ Potential issue | 🔴 Critical

Do not generate topic summaries for blocked moderation results.

After retrieve_response() returns a blocked TurnSummary, this handler still falls through to the topic-summary branch for new conversations. That sends rejected input to another model call and weakens the shield.

🛡️ Minimal guard

-    # Get topic summary for new conversation
-    if not user_conversation and query_request.generate_topic_summary:
+    # Get topic summary only when the request actually reached the model
+    if (
+        moderation_result.decision == "passed"
+        and not user_conversation
+        and query_request.generate_topic_summary
+    ):
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(
             query_request.query, client, responses_params.model
         )
     else:
         topic_summary = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 198 - 221, The handler currently
calls get_topic_summary for new conversations even when retrieve_response
returned a blocked moderation_result; update the guard so topic summaries are
only generated when the moderation result passed. Concretely, change the
condition that uses user_conversation and query_request.generate_topic_summary
to also require moderation_result.decision == "passed" (references:
retrieve_response, moderation_result, query_request.generate_topic_summary,
get_topic_summary, user_conversation) so blocked or rejected inputs do not
trigger an extra model call — otherwise set topic_summary = None.

src/utils/shields.py (1)

131-156: ⚠️ Potential issue | 🟠 Major

Skip models.list() when no shields will run, and translate its failures too.

shield_ids=[] is documented as “skip all shields”, but this path still makes a backend client.models.list() call before returning. That adds an unnecessary dependency for a no-op moderation request, and any APIConnectionError / APIStatusError from model discovery bypasses the new HTTPException mapping entirely.

🛠️ Minimal fix

     shields_to_run = await get_shields_for_request(client, shield_ids)
-    available_models = {model.id for model in await client.models.list()}
+    if not shields_to_run:
+        return ShieldModerationPassed()
+
+    try:
+        available_models = {model.id for model in await client.models.list()}
+    except APIConnectionError as e:
+        error_response = ServiceUnavailableResponse(
+            backend_name="Llama Stack",
+            cause=str(e),
+        )
+        raise HTTPException(**error_response.model_dump()) from e
+    except APIStatusError as e:
+        error_response = InternalServerErrorResponse.generic()
+        raise HTTPException(**error_response.model_dump()) from e

As per coding guidelines, src/**/*.py: Handle APIConnectionError from Llama Stack in API operations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 131 - 156, The code calls
client.models.list() unconditionally which triggers unnecessary backend calls
and can raise APIConnectionError/APIStatusError before we know shields_to_run;
change logic in get_shields_for_request / the loop where shields_to_run is
computed so that you only call client.models.list() if shields_to_run is
non-empty (i.e., after awaiting get_shields_for_request and confirming
shields_to_run), and wrap the client.models.list() call in the same exception
handling used for client.moderations.create (catch APIConnectionError and
APIStatusError and translate them into ServiceUnavailableResponse /
InternalServerErrorResponse HTTPExceptions) so failures from model discovery are
mapped consistently; reference symbols: shields_to_run, get_shields_for_request,
client.models.list, available_models, moderation_result,
client.moderations.create.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 304-320: The branch handling context.moderation_result.decision ==
"blocked" must short-circuit post-stream processing instead of returning a
generator that the normal completion path will treat as a model stream; mark the
request as blocked (e.g., set a blocked flag on context or responses_params, or
set a field on turn_summary) before returning the shield_violation_generator,
and update the caller(s) such as generate_response and any post-processing
functions like get_topic_summary to check that blocked flag and skip
model-backed post-processing when true so blocked streams never flow through the
normal completion wrapper.

In `@src/utils/shields.py`:
- Around line 256-259: The f-string in the NotFoundResponse call is invalid due
to unescaped nested double quotes; update the resource argument in the
NotFoundResponse call (the line constructing resource=f"Shield{"s" if
len(missing) > 1 else ""}") to use a valid string expression—e.g., use single
quotes inside the expression or concatenate ("Shield" + ("s" if len(missing) > 1
else ""))—so the NotFoundResponse invocation compiles correctly; keep references
to NotFoundResponse and the missing variable unchanged.

In `@src/utils/types.py`:
- Line 330: build_turn_summary currently returns a TurnSummary without copying
response.id, so only blocked turns get the new TurnSummary.id while successful
turns remain empty; update src/utils/responses.py function build_turn_summary to
set TurnSummary.id = response.id (or pass id into the TurnSummary constructor)
whenever a response has an id, ensuring both blocked and non-blocked paths
populate the id field on the returned TurnSummary object.

---

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 198-221: The handler currently calls get_topic_summary for new
conversations even when retrieve_response returned a blocked moderation_result;
update the guard so topic summaries are only generated when the moderation
result passed. Concretely, change the condition that uses user_conversation and
query_request.generate_topic_summary to also require moderation_result.decision
== "passed" (references: retrieve_response, moderation_result,
query_request.generate_topic_summary, get_topic_summary, user_conversation) so
blocked or rejected inputs do not trigger an extra model call — otherwise set
topic_summary = None.

In `@src/utils/shields.py`:
- Around line 131-156: The code calls client.models.list() unconditionally which
triggers unnecessary backend calls and can raise
APIConnectionError/APIStatusError before we know shields_to_run; change logic in
get_shields_for_request / the loop where shields_to_run is computed so that you
only call client.models.list() if shields_to_run is non-empty (i.e., after
awaiting get_shields_for_request and confirming shields_to_run), and wrap the
client.models.list() call in the same exception handling used for
client.moderations.create (catch APIConnectionError and APIStatusError and
translate them into ServiceUnavailableResponse / InternalServerErrorResponse
HTTPExceptions) so failures from model discovery are mapped consistently;
reference symbols: shields_to_run, get_shields_for_request, client.models.list,
available_models, moderation_result, client.moderations.create.

In `@tests/unit/utils/test_vector_search.py`:
- Around line 455-500: Add a new async test that calls
build_rag_context(client_mock, "blocked", "any query", None) to cover the
moderation short-circuit branch, mock the configuration to enable inline BYOK
and inline Solr as in the other tests, and assert the returned RAGContext has
empty context_text, rag_chunks, and referenced_documents; also assert that the
helper methods _fetch_byok_rag and _fetch_solr_rag (mocked or spied on) were not
called. Use the same test file tests/unit/utils/test_vector_search.py and name
the test something like test_moderation_blocked_short_circuits_inline; reference
build_rag_context, _fetch_byok_rag, and _fetch_solr_rag when locating the code
to mock/spy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3253b00-6d5f-4a4e-abc6-36627d3ad421

📥 Commits

Reviewing files that changed from the base of the PR and between e73778a and 36c2fa3.

📒 Files selected for processing (13)

src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/models/context.py
src/utils/conversations.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

src/app/endpoints/streaming_query.py

src/utils/shields.py

src/utils/types.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/utils/shields.py (1)

131-156: ⚠️ Potential issue | 🟠 Major

Wrap model discovery in error handling; add early return for empty shields.

Line 132 calls await client.models.list() outside the try/except block. If this API call fails with APIConnectionError or APIStatusError, the exception bubbles up unmapped, violating the error handling contract. Additionally, if shields_to_run is empty, the function still queries for available models unnecessarily.

Suggested fix

     shields_to_run = await get_shields_for_request(client, shield_ids)
+    if not shields_to_run:
+        return ShieldModerationPassed()
+
+    try:
+        available_models = {model.id for model in await client.models.list()}
+    except APIConnectionError as e:
+        error_response = ServiceUnavailableResponse(
+            backend_name="Llama Stack",
+            cause=str(e),
+        )
+        raise HTTPException(**error_response.model_dump()) from e
+    except APIStatusError as e:
+        error_response = InternalServerErrorResponse.generic()
+        raise HTTPException(**error_response.model_dump()) from e
     for shield in shields_to_run:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 131 - 156, If get_shields_for_request
returns no shields, return early before calling the models API; otherwise, move
the await client.models.list() call into the existing try/except and wrap it to
catch APIConnectionError and APIStatusError (same mapping used for
client.moderations.create): on APIConnectionError raise HTTPException with
ServiceUnavailableResponse(backend_name="Llama Stack", cause=str(e)); on
APIStatusError raise HTTPException with InternalServerErrorResponse.generic();
update the code paths around shields_to_run, available_models and the loop that
references shield.provider_resource_id (functions/methods involved:
get_shields_for_request, client.models.list, client.moderations.create) so model
discovery is protected by the same error handling and skipped when
shields_to_run is empty.

src/app/endpoints/streaming_query.py (1)

293-296: ⚠️ Potential issue | 🟡 Minor

Stale docstring parameter.

The docstring mentions inline_rag_documents parameter at line 296, but this parameter was removed from the function signature.

📝 Proposed fix

     Args:
         responses_params: The Responses API parameters
         context: The response generator context
-        inline_rag_documents: Referenced documents from inline  RAG (BYOK + Solr)

     Returns:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 293 - 296, The docstring
for the function that accepts responses_params and context still mentions the
removed parameter inline_rag_documents; update the docstring to reflect the
current signature by removing any reference to inline_rag_documents (or replace
it with the correct current parameter name if applicable) and ensure the Args
section only lists actual parameters (e.g., responses_params, context) and their
descriptions for the function in streaming_query.py.

src/app/endpoints/query.py (1)

214-221: ⚠️ Potential issue | 🟠 Major

Blocked content still triggers get_topic_summary().

Same issue as in streaming_query.py - when moderation blocks, get_topic_summary() is still called for new conversations. This makes an LLM call with content that was already flagged by moderation.

🛡️ Proposed fix

     # Get topic summary for new conversation
-    if not user_conversation and query_request.generate_topic_summary:
+    if (
+        not user_conversation
+        and query_request.generate_topic_summary
+        and moderation_result.decision == "passed"
+    ):
         logger.debug("Generating topic summary for new conversation")
         topic_summary = await get_topic_summary(
             query_request.query, client, responses_params.model
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 214 - 221, The topic summary is
still generated for new conversations even when moderation blocked the content;
modify the logic around get_topic_summary so it only runs when this is a new
conversation (user_conversation is falsy), query_request.generate_topic_summary
is true, AND the moderation step did not block the request—i.e., move or add a
guard that checks the moderation result (use the same moderation flag/variable
used elsewhere in this flow) before calling
get_topic_summary(query_request.query, client, responses_params.model); ensure
the variables user_conversation, query_request.generate_topic_summary and
get_topic_summary are referenced to locate and update the condition accordingly.

♻️ Duplicate comments (2)

src/utils/shields.py (1)

256-259: ⚠️ Potential issue | 🔴 Critical

Fix the f-string before this module can import.

The nested double quotes in resource=... are invalid Python syntax, so src/utils/shields.py will fail to load and every shield path will break.

🐛 Minimal fix

-                resource=f"Shield{"s" if len(missing) > 1 else ""}",
+                resource=f"Shield{'s' if len(missing) > 1 else ''}",

This verification should currently fail with a SyntaxError at Line 257:

#!/bin/bash
python - <<'PY'
from pathlib import Path
import ast

path = Path("src/utils/shields.py")
source = path.read_text()

try:
    ast.parse(source)
    print("syntax ok")
except SyntaxError as exc:
    print(f"{path}:{exc.lineno}:{exc.offset}: {exc.msg}")
    if exc.text:
        print(exc.text.rstrip())
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 256 - 259, The f-string in the
NotFoundResponse call is syntactically invalid due to nested double quotes;
update the resource argument in the NotFoundResponse(...) call (the one
constructing resource for pluralization) to use valid quoting or a small
precomputed variable (e.g., compute plural = "s" if len(missing) > 1 else "" or
use single quotes inside the f-string) so the expression like
resource=f"Shield{'s' if len(missing) > 1 else ''}" is a valid Python expression
that lets the module import; ensure you only change the resource argument near
the NotFoundResponse instantiation and keep resource_id=", ".join(missing)
intact.

src/app/endpoints/streaming_query.py (1)

304-320: ⚠️ Potential issue | 🟠 Major

Blocked streams still trigger get_topic_summary() for new conversations.

When moderation blocks, the shield_violation_generator is returned and passed to generate_response(). However, generate_response() still calls get_topic_summary() at lines 527-535 for new conversations regardless of moderation status. This:

Makes an unnecessary LLM call for blocked content

Could expose the model to the same content that moderation already blocked

Consider checking context.moderation_result.decision before calling get_topic_summary():
🛡️ Proposed fix in generate_response()
     # Get topic summary for new conversations if needed
     topic_summary = None
-    if not context.query_request.conversation_id:
+    if not context.query_request.conversation_id and context.moderation_result.decision == "passed":
         should_generate = context.query_request.generate_topic_summary
         if should_generate:
             logger.debug("Generating topic summary for new conversation")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 304 - 320, The code
returns a shield_violation_generator when moderation blocks, but
generate_response still calls get_topic_summary() for new conversations; update
generate_response to check context.moderation_result.decision (or equivalent
moderation flag) before invoking get_topic_summary() so that if decision ==
"blocked" you skip the topic-summary call and any LLM work for new
conversations; locate the generate_response function and add a guard that
short-circuits get_topic_summary() when moderation indicates "blocked" (keeping
existing behavior for non-blocked flows and reusing
shield_violation_generator/turn_summary handling).

🧹 Nitpick comments (1)

src/utils/conversations.py (1)

438-478: Avoid a second conversation-append implementation.

This repeats the same client.conversations.items.create(...) call and APIConnectionError / APIStatusError translation that already exists in src/utils/shields.py. Making the shield-specific helper delegate here, or extracting a shared low-level helper, would keep the two persistence paths from drifting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/conversations.py` around lines 438 - 478,
append_turn_items_to_conversation duplicates the low-level persistence call and
error translation around client.conversations.items.create (catching
APIConnectionError/APIStatusError and raising
ServiceUnavailableResponse/InternalServerErrorResponse); instead, remove the
try/except and delegate to the shared helper that already implements this logic
(the helper in shields.py that wraps client.conversations.items.create), or
extract a new shared function like safe_create_conversation_items(client,
conversation_id, items) and call that from append_turn_items_to_conversation;
keep the same item construction (user_items/output_items) but replace the direct
client.conversations.items.create call with a call to the shared helper so the
APIConnectionError/APIStatusError translation (to
ServiceUnavailableResponse/InternalServerErrorResponse) is centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 214-221: The topic summary is still generated for new
conversations even when moderation blocked the content; modify the logic around
get_topic_summary so it only runs when this is a new conversation
(user_conversation is falsy), query_request.generate_topic_summary is true, AND
the moderation step did not block the request—i.e., move or add a guard that
checks the moderation result (use the same moderation flag/variable used
elsewhere in this flow) before calling get_topic_summary(query_request.query,
client, responses_params.model); ensure the variables user_conversation,
query_request.generate_topic_summary and get_topic_summary are referenced to
locate and update the condition accordingly.

In `@src/app/endpoints/streaming_query.py`:
- Around line 293-296: The docstring for the function that accepts
responses_params and context still mentions the removed parameter
inline_rag_documents; update the docstring to reflect the current signature by
removing any reference to inline_rag_documents (or replace it with the correct
current parameter name if applicable) and ensure the Args section only lists
actual parameters (e.g., responses_params, context) and their descriptions for
the function in streaming_query.py.

In `@src/utils/shields.py`:
- Around line 131-156: If get_shields_for_request returns no shields, return
early before calling the models API; otherwise, move the await
client.models.list() call into the existing try/except and wrap it to catch
APIConnectionError and APIStatusError (same mapping used for
client.moderations.create): on APIConnectionError raise HTTPException with
ServiceUnavailableResponse(backend_name="Llama Stack", cause=str(e)); on
APIStatusError raise HTTPException with InternalServerErrorResponse.generic();
update the code paths around shields_to_run, available_models and the loop that
references shield.provider_resource_id (functions/methods involved:
get_shields_for_request, client.models.list, client.moderations.create) so model
discovery is protected by the same error handling and skipped when
shields_to_run is empty.

---

Duplicate comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 304-320: The code returns a shield_violation_generator when
moderation blocks, but generate_response still calls get_topic_summary() for new
conversations; update generate_response to check
context.moderation_result.decision (or equivalent moderation flag) before
invoking get_topic_summary() so that if decision == "blocked" you skip the
topic-summary call and any LLM work for new conversations; locate the
generate_response function and add a guard that short-circuits
get_topic_summary() when moderation indicates "blocked" (keeping existing
behavior for non-blocked flows and reusing
shield_violation_generator/turn_summary handling).

In `@src/utils/shields.py`:
- Around line 256-259: The f-string in the NotFoundResponse call is
syntactically invalid due to nested double quotes; update the resource argument
in the NotFoundResponse(...) call (the one constructing resource for
pluralization) to use valid quoting or a small precomputed variable (e.g.,
compute plural = "s" if len(missing) > 1 else "" or use single quotes inside the
f-string) so the expression like resource=f"Shield{'s' if len(missing) > 1 else
''}" is a valid Python expression that lets the module import; ensure you only
change the resource argument near the NotFoundResponse instantiation and keep
resource_id=", ".join(missing) intact.

---

Nitpick comments:
In `@src/utils/conversations.py`:
- Around line 438-478: append_turn_items_to_conversation duplicates the
low-level persistence call and error translation around
client.conversations.items.create (catching APIConnectionError/APIStatusError
and raising ServiceUnavailableResponse/InternalServerErrorResponse); instead,
remove the try/except and delegate to the shared helper that already implements
this logic (the helper in shields.py that wraps
client.conversations.items.create), or extract a new shared function like
safe_create_conversation_items(client, conversation_id, items) and call that
from append_turn_items_to_conversation; keep the same item construction
(user_items/output_items) but replace the direct
client.conversations.items.create call with a call to the shared helper so the
APIConnectionError/APIStatusError translation (to
ServiceUnavailableResponse/InternalServerErrorResponse) is centralized.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ddb91651-773b-4a11-b9e4-275db8e8c9c3

📥 Commits

Reviewing files that changed from the base of the PR and between 36c2fa3 and b93f99c.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

pyproject.toml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/models/context.py
src/utils/conversations.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/info.feature
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (3)

src/utils/types.py
tests/unit/utils/test_vector_search.py
src/models/context.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/app/endpoints/streaming_query.py (1)
713-727: ⚠️ Potential issue | 🟡 Minor

turn_summary.id is not set in the streaming path.

In the non-streaming query.py, build_turn_summary sets summary.id = response.id from the response object. However, in this streaming path, turn_summary.id is never extracted from latest_response_object, leaving it as the default empty string. This creates an inconsistency where blocked requests have a proper ID (moderation_id) but successful streaming responses do not.
🔧 Proposed fix
         # Completed response - capture final text and response object
         elif event_type == "response.completed":
             latest_response_object = cast(
                 OpenAIResponseObject, getattr(chunk, "response")
             )
+            turn_summary.id = latest_response_object.id
             turn_summary.llm_response = turn_summary.llm_response or "".join(text_parts)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 713 - 727, The streaming
path doesn't set turn_summary.id, causing successful streamed responses to lack
the response ID; update the "response.completed" branch to extract and assign
the ID from latest_response_object (e.g., set turn_summary.id =
getattr(latest_response_object, "id", turn_summary.id) or
latest_response_object.id if present) before yielding the
LLM_TURN_COMPLETE_EVENT so the turn_summary (used elsewhere like moderation_id
comparisons) contains the same response.id as non-streaming flow (see symbols
turn_summary and latest_response_object in the response.completed branch).

🧹 Nitpick comments (2)

tests/unit/app/endpoints/test_query.py (1)

550-567: Assert the blocked turn id as well.

This test now seeds moderation_id, but it only checks llm_response. Adding an assertion for result.id will catch regressions in the new TurnSummary.id field.

🧪 Suggested assertion

         assert isinstance(result, TurnSummary)
+        assert result.id == "mod_123"
         assert result.llm_response == "Content blocked by moderation"
         mock_append.assert_called_once()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_query.py` around lines 550 - 567, The test
seeds mock_moderation_result.moderation_id but never asserts the TurnSummary.id;
update the test around the retrieve_response call (which returns a TurnSummary)
to assert the blocked turn id is set by adding an assertion that result.id
equals mock_moderation_result.moderation_id (or the literal "mod_123") in the
same test where result.llm_response is checked and mock_append is asserted.

src/utils/shields.py (1)

13-13: Replace the private openai._exceptions import with the public API.

The private module openai._exceptions is brittle across SDK upgrades. Use the public import from openai import APIStatusError instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` at line 13, Replace the private import from
openai._exceptions with the public exported symbol: remove "from
openai._exceptions import APIStatusError as OpenAIAPIStatusError" and import the
public APIStatusError instead (use the name APIStatusError where
OpenAIAPIStatusError is referenced in this module, e.g., in shields.py functions
or exception handling). Ensure all references to OpenAIAPIStatusError are
updated to the public APIStatusError identifier.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/shields.py`:
- Around line 137-138: Short-circuit the no-shields path: after calling
get_shields_for_request (shields_to_run), if shields_to_run is empty immediately
return ShieldModerationPassed() instead of proceeding to call
client.models.list(); additionally wrap the client.models.list() call in a
try/except that catches the Llama Stack APIConnectionError (or equivalent) and
returns ShieldModerationPassed() or handles it gracefully so transient
network/errors while listing models don't surface as unmapped exceptions.

---

Outside diff comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 713-727: The streaming path doesn't set turn_summary.id, causing
successful streamed responses to lack the response ID; update the
"response.completed" branch to extract and assign the ID from
latest_response_object (e.g., set turn_summary.id =
getattr(latest_response_object, "id", turn_summary.id) or
latest_response_object.id if present) before yielding the
LLM_TURN_COMPLETE_EVENT so the turn_summary (used elsewhere like moderation_id
comparisons) contains the same response.id as non-streaming flow (see symbols
turn_summary and latest_response_object in the response.completed branch).

---

Nitpick comments:
In `@src/utils/shields.py`:
- Line 13: Replace the private import from openai._exceptions with the public
exported symbol: remove "from openai._exceptions import APIStatusError as
OpenAIAPIStatusError" and import the public APIStatusError instead (use the name
APIStatusError where OpenAIAPIStatusError is referenced in this module, e.g., in
shields.py functions or exception handling). Ensure all references to
OpenAIAPIStatusError are updated to the public APIStatusError identifier.

In `@tests/unit/app/endpoints/test_query.py`:
- Around line 550-567: The test seeds mock_moderation_result.moderation_id but
never asserts the TurnSummary.id; update the test around the retrieve_response
call (which returns a TurnSummary) to assert the blocked turn id is set by
adding an assertion that result.id equals mock_moderation_result.moderation_id
(or the literal "mod_123") in the same test where result.llm_response is checked
and mock_append is asserted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e7fae02e-26bf-47cd-bc85-66f8139fb4f8

📥 Commits

Reviewing files that changed from the base of the PR and between b93f99c and 1261239.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

pyproject.toml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/models/context.py
src/utils/conversations.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/info.feature
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

🚧 Files skipped from review as they are similar to previous changes (5)

tests/e2e/features/info.feature
pyproject.toml
src/utils/responses.py
src/models/context.py
tests/unit/utils/test_vector_search.py

src/utils/shields.py

asimurka · 2026-03-10T08:32:02Z

src/utils/vector_search.py

    Returns:
        RAGContext containing formatted context text and referenced documents
    """
+    if moderation_decision == "blocked":


Don't do inline RAG if input was already marked as blocked

asimurka · 2026-03-10T08:38:32Z

src/app/endpoints/query.py

    # Moderation input is the raw user content (query + attachments) without injected RAG
    # context, to avoid false positives from retrieved document content.
    moderation_input = prepare_input(query_request)
+    moderation_result = await run_shield_moderation(


Do moderation before the inline RAG

asimurka · 2026-03-10T08:39:07Z

src/app/endpoints/query.py

        stream=False,
        store=True,
        request_headers=request.headers,
-        inline_rag_context=inline_rag_context.context_text or None,


Model uses default factory list

asimurka · 2026-03-10T08:39:44Z

src/app/endpoints/query.py

        client = await update_azure_token(client)

-    # Build index identification mapping for RAG source resolution
-    vector_store_ids = extract_vector_store_ids_from_tools(responses_params.tools)


Do this inside retrieve_response instead of passing as arguments

asimurka · 2026-03-10T08:40:07Z

src/app/endpoints/query.py

-    )
+    turn_summary = await retrieve_response(client, responses_params, moderation_result)
+
+    if moderation_result.decision == "passed":


Only append RAG chunks if moderation passed

asimurka · 2026-03-10T08:40:35Z

src/app/endpoints/query.py

        error_response = handle_known_apistatus_errors(e, responses_params.model)
        raise HTTPException(**error_response.model_dump()) from e

+    vector_store_ids = extract_vector_store_ids_from_tools(responses_params.tools)


Replaced from outer scope

asimurka · 2026-03-10T08:42:14Z

src/app/endpoints/streaming_query.py

    )

+    # Combine inline RAG results (BYOK + Solr) with tool-based results
+    if context.moderation_result.decision == "passed":


Combine tool-based and inline referenced docs here

asimurka · 2026-03-10T08:49:21Z

src/utils/shields.py

            moderation_result = await client.moderations.create(
                input=input_text, model=shield.provider_resource_id
            )
-        # Known Llama Stack bug: error is raised when violation is present


This bug was fiexed by #5045

asimurka · 2026-03-10T08:50:27Z

src/utils/shields.py

-
-        # Reject if no requested shields were found (prevents accidental bypass)
-        if not shields_to_run:
-            response = UnprocessableEntityResponse(


Use 404 instead of 422

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/app/endpoints/query.py (1)

162-186: ⚠️ Potential issue | 🟠 Major

Blocked requests still pay the full response-prep cost.

After moderation blocks the input, the handler still goes through prepare_responses_params(). That can list models, resolve vector stores, and build tools before retrieve_response() returns the refusal. A blocked request can therefore still do the expensive/network-dependent work this refactor is trying to skip, and may fail with an unrelated 5xx instead of returning the moderation message.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/query.py` around lines 162 - 186, After
run_shield_moderation returns, short-circuit blocked requests immediately
instead of continuing to build RAG context or call prepare_responses_params;
inspect moderation_result.decision (or equivalent "blocked"/"refuse" state) and
if it indicates a block, return the moderation refusal response to the caller
(or call the existing path that yields the refusal) without invoking
build_rag_context or prepare_responses_params so you avoid listing models,
resolving vector stores, or initializing tools for blocked inputs. Ensure the
decision check happens right after run_shield_moderation and before any
expensive calls (build_rag_context, prepare_responses_params,
retrieve_response).

tests/unit/utils/test_vector_search.py (1)

455-500: ⚠️ Potential issue | 🟡 Minor

Add coverage for the new blocked branch.

The updated tests only exercise moderation_decision="passed". Please add a "blocked" case that asserts build_rag_context() returns an empty RAGContext and never calls the BYOK/Solr fetchers; otherwise the main optimization from this refactor can regress unnoticed.

🧪 Suggested test shape

 `@pytest.mark.asyncio`
+async def test_blocked_skips_all_rag_fetches(self, mocker) -> None:  # type: ignore[no-untyped-def]
+    client_mock = mocker.AsyncMock()
+    fetch_byok = mocker.patch(
+        "utils.vector_search._fetch_byok_rag",
+        new=mocker.AsyncMock(),
+    )
+    fetch_solr = mocker.patch(
+        "utils.vector_search._fetch_solr_rag",
+        new=mocker.AsyncMock(),
+    )
+
+    context = await build_rag_context(client_mock, "blocked", "test query", None)
+
+    assert context.context_text == ""
+    assert context.rag_chunks == []
+    assert context.referenced_documents == []
+    fetch_byok.assert_not_called()
+    fetch_solr.assert_not_called()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_vector_search.py` around lines 455 - 500, Add a test
exercising moderation_decision="blocked" for build_rag_context: create a
config_mock like in test_byok_enabled_only (inline BYOK configured and
inline_solr_enabled False), create an AsyncMock client_mock and patch any
Solr/BYOK fetcher or client methods (e.g., client_mock.vector_io.query) to be
spies, then call await build_rag_context(client_mock, "blocked", "test query",
None) and assert the returned RAGContext has empty context_text, rag_chunks, and
referenced_documents, and assert that client_mock.vector_io.query (and any Solr
fetcher mocks) were never awaited/called to ensure the blocked branch
short-circuits the fetchers.

♻️ Duplicate comments (1)

src/utils/shields.py (1)

137-138: ⚠️ Potential issue | 🟠 Major

Return early when no shields will run.

get_shields_for_request() can legitimately return [] (no configured shields, or shield_ids=[]), but Line 138 still calls client.models.list() without any error mapping. That adds avoidable network I/O to the no-op path and can turn a request that should return ShieldModerationPassed() into an unrelated backend failure.

🛠️ Suggested fix

     shields_to_run = await get_shields_for_request(client, shield_ids)
-    available_models = {model.id for model in await client.models.list()}
+    if not shields_to_run:
+        return ShieldModerationPassed()
+
+    try:
+        available_models = {model.id for model in await client.models.list()}
+    except APIConnectionError as e:
+        error_response = ServiceUnavailableResponse(
+            backend_name="Llama Stack",
+            cause=str(e),
+        )
+        raise HTTPException(**error_response.model_dump()) from e
+    except LLSApiStatusError as e:
+        error_response = InternalServerErrorResponse.generic()
+        raise HTTPException(**error_response.model_dump()) from e
+
     for shield in shields_to_run:

As per coding guidelines: Handle APIConnectionError from Llama Stack in API operations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 137 - 138, After calling
get_shields_for_request(client, shield_ids), check if shields_to_run is empty
and return the no-op result (e.g., ShieldModerationPassed()) immediately to
avoid calling client.models.list(); if shields exist, then call
client.models.list() to build available_models. Also wrap the
client.models.list() call in a try/except that catches APIConnectionError (from
the Llama stack) and maps/raises it to the module's expected error type instead
of letting it surface as an unrelated backend failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/shields.py`:
- Around line 256-270: The helper currently calls client.shields.list() even
when shield_ids is an explicit empty list (meaning "skip all shields"); modify
the logic in the function handling shield resolution (the code block that uses
configured_shields, requested, configured_ids, missing) to check for shield_ids
== [] and immediately return an empty list before calling client.shields.list();
ensure this check preserves the existing behavior for shield_ids is None (i.e.,
still fetch configured_shields via client.shields.list()), and keep the rest of
the missing-ID validation (using requested, configured_ids, NotFoundResponse,
and HTTPException) unchanged.

---

Outside diff comments:
In `@src/app/endpoints/query.py`:
- Around line 162-186: After run_shield_moderation returns, short-circuit
blocked requests immediately instead of continuing to build RAG context or call
prepare_responses_params; inspect moderation_result.decision (or equivalent
"blocked"/"refuse" state) and if it indicates a block, return the moderation
refusal response to the caller (or call the existing path that yields the
refusal) without invoking build_rag_context or prepare_responses_params so you
avoid listing models, resolving vector stores, or initializing tools for blocked
inputs. Ensure the decision check happens right after run_shield_moderation and
before any expensive calls (build_rag_context, prepare_responses_params,
retrieve_response).

In `@tests/unit/utils/test_vector_search.py`:
- Around line 455-500: Add a test exercising moderation_decision="blocked" for
build_rag_context: create a config_mock like in test_byok_enabled_only (inline
BYOK configured and inline_solr_enabled False), create an AsyncMock client_mock
and patch any Solr/BYOK fetcher or client methods (e.g.,
client_mock.vector_io.query) to be spies, then call await
build_rag_context(client_mock, "blocked", "test query", None) and assert the
returned RAGContext has empty context_text, rag_chunks, and
referenced_documents, and assert that client_mock.vector_io.query (and any Solr
fetcher mocks) were never awaited/called to ensure the blocked branch
short-circuits the fetchers.

---

Duplicate comments:
In `@src/utils/shields.py`:
- Around line 137-138: After calling get_shields_for_request(client,
shield_ids), check if shields_to_run is empty and return the no-op result (e.g.,
ShieldModerationPassed()) immediately to avoid calling client.models.list(); if
shields exist, then call client.models.list() to build available_models. Also
wrap the client.models.list() call in a try/except that catches
APIConnectionError (from the Llama stack) and maps/raises it to the module's
expected error type instead of letting it surface as an unrelated backend
failure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9ad4e7d-5ff5-483f-a424-805c5aa04f74

📥 Commits

Reviewing files that changed from the base of the PR and between 1261239 and dac0275.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

pyproject.toml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/models/context.py
src/utils/conversations.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/info.feature
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

🚧 Files skipped from review as they are similar to previous changes (7)

tests/e2e/features/info.feature
src/utils/types.py
src/constants.py
pyproject.toml
src/models/context.py
src/app/endpoints/streaming_query.py
src/utils/conversations.py

src/utils/shields.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/app/endpoints/streaming_query.py (1)

305-320: ⚠️ Potential issue | 🟠 Major

Returning the refusal generator here still triggers the normal success footer.

Lines 305-320 hand back a plain async iterator, so generate_response() later treats a blocked request as a successful completion and still runs topic-summary generation plus quota/token bookkeeping. That reintroduces model-backed work after moderation already rejected the input, and footer failures can still tear down a refused stream. Please carry an explicit blocked flag, or branch the wrapper, so the success-only footer is skipped for this path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/endpoints/streaming_query.py` around lines 305 - 320, The refusal
path currently returns only (shield_violation_generator(...), turn_summary)
which causes generate_response() to treat it as a normal success and run the
success-only footer and bookkeeping; change this return to include an explicit
blocked flag or different wrapper so generate_response can detect it and skip
the success footer. Concretely, update the block in streaming_query.py that uses
context.moderation_result to return a 3-tuple (response_iterator, turn_summary,
blocked=True) or a small ResponseWrapper object, and then modify
generate_response (the consumer of that return) to check the blocked flag or
wrapper type and branch to skip topic-summary generation and quota/token
bookkeeping when blocked is true; keep existing calls like
append_turn_items_to_conversation and shield_violation_generator unchanged.

src/utils/shields.py (1)

137-149: ⚠️ Potential issue | 🟠 Major

Don't call models.list() on paths that never need a shield model lookup.

After Lines 137-149, requests with shield_ids=[], no configured shields, or only non-llama-guard shields still make a Llama Stack models.list() call. That adds avoidable latency and can surface a 503/500 before moderation even starts, even though those paths should just return ShieldModerationPassed().

🛠️ Suggested fix

     shields_to_run = await get_shields_for_request(client, shield_ids)
-    available_models = {model.id for model in await client.models.list()}
+    if not shields_to_run:
+        return ShieldModerationPassed()
+
+    available_models: Optional[set[str]] = None
     for shield in shields_to_run:
-        if shield.provider_id == "llama-guard" and (
-            not shield.provider_resource_id
-            or shield.provider_resource_id not in available_models
-        ):
+        if shield.provider_id == "llama-guard":
+            if available_models is None:
+                try:
+                    available_models = {model.id for model in await client.models.list()}
+                except APIConnectionError as e:
+                    error_response = ServiceUnavailableResponse(
+                        backend_name="Llama Stack",
+                        cause=str(e),
+                    )
+                    raise HTTPException(**error_response.model_dump()) from e
+                except LLSApiStatusError as e:
+                    error_response = InternalServerErrorResponse.generic()
+                    raise HTTPException(**error_response.model_dump()) from e
+
+            if (
+                not shield.provider_resource_id
+                or shield.provider_resource_id not in available_models
+            ):
                 logger.error("Shield model not found: %s", shield.provider_resource_id)
                 response = NotFoundResponse(
                     resource="Shield model", resource_id=shield.provider_resource_id or ""
                 )
                 raise HTTPException(**response.model_dump())

Run this to confirm the unconditional models.list() call is still on the fast paths:

#!/bin/bash
set -euo pipefail
python - <<'PY'
from pathlib import Path

path = Path("src/utils/shields.py")
lines = path.read_text().splitlines()

for start, end in [(114, 181), (237, 281)]:
    print(f"\n--- {path}:{start}-{end} ---")
    for i in range(start - 1, end):
        print(f"{i + 1}: {lines[i]}")
PY

Expected result: run_shield_moderation() calls client.models.list() immediately after get_shields_for_request() and before any if not shields_to_run or lazy provider_id == "llama-guard" guard. Based on learnings: Applies to src/**/*.py : Handle APIConnectionError from Llama Stack in API operations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/utils/shields.py` around lines 137 - 149, The code eagerly calls
client.models.list() right after get_shields_for_request(), causing unnecessary
Llama Stack calls on fast paths; change run_shield_moderation so you only call
client.models.list() when shields_to_run is non-empty and at least one shield
has provider_id == "llama-guard" (check shields_to_run and any(s.provider_id ==
"llama-guard" for s in shields_to_run)) before invoking client.models.list();
use the existing shields_to_run variable and the
shield.provider_id/provider_resource_id checks to gate the lookup and avoid
introducing latency or upstream 5xxs on requests that don't need a llama-guard
model lookup.

🧹 Nitpick comments (1)

tests/unit/app/endpoints/test_streaming_query.py (1)
845-880: This test stops one layer short of the blocked-stream regression.

The new path here only asserts that retrieve_response_generator() picks the refusal generator. The bug is in the wrapper above it: generate_response() still treats that generator as a normal completion. Please run the returned generator through generate_response() (or the full endpoint) and assert get_topic_summary() / consume_query_tokens() stay untouched for blocked moderation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/app/endpoints/test_streaming_query.py` around lines 845 - 880, The
test currently verifies retrieve_response_generator returns the refusal
generator but doesn't exercise the wrapper that caused the regression; update
the test to take the returned generator from retrieve_response_generator (call
it _generator) and pass it into generate_response (or call the full streaming
endpoint) so the refusal generator is executed through generate_response, then
assert that get_topic_summary() and consume_query_tokens() were not
called/modified for the blocked moderation case and that the TurnSummary
produced by generate_response (or the endpoint) retains llm_response == "Content
blocked"; use the existing mocked objects (mock_context, mock_client,
mock_moderation_result) and patch spies on get_topic_summary and
consume_query_tokens to validate they remain untouched.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/models/requests.py`:
- Around line 179-183: The Field description for shield_ids in the request
schema only documents the None case but omits that an explicit empty list
opt-out means "run no shields"; update the description string for shield_ids
(Field on the shield_ids attribute) to state both behaviors: None = use all
configured shields, [] = explicitly run no shields, and provide example for the
empty list; ensure the public contract matches get_shields_for_request()
semantics so OpenAPI docs reflect the empty-list opt-out.

---

Duplicate comments:
In `@src/app/endpoints/streaming_query.py`:
- Around line 305-320: The refusal path currently returns only
(shield_violation_generator(...), turn_summary) which causes generate_response()
to treat it as a normal success and run the success-only footer and bookkeeping;
change this return to include an explicit blocked flag or different wrapper so
generate_response can detect it and skip the success footer. Concretely, update
the block in streaming_query.py that uses context.moderation_result to return a
3-tuple (response_iterator, turn_summary, blocked=True) or a small
ResponseWrapper object, and then modify generate_response (the consumer of that
return) to check the blocked flag or wrapper type and branch to skip
topic-summary generation and quota/token bookkeeping when blocked is true; keep
existing calls like append_turn_items_to_conversation and
shield_violation_generator unchanged.

In `@src/utils/shields.py`:
- Around line 137-149: The code eagerly calls client.models.list() right after
get_shields_for_request(), causing unnecessary Llama Stack calls on fast paths;
change run_shield_moderation so you only call client.models.list() when
shields_to_run is non-empty and at least one shield has provider_id ==
"llama-guard" (check shields_to_run and any(s.provider_id == "llama-guard" for s
in shields_to_run)) before invoking client.models.list(); use the existing
shields_to_run variable and the shield.provider_id/provider_resource_id checks
to gate the lookup and avoid introducing latency or upstream 5xxs on requests
that don't need a llama-guard model lookup.

---

Nitpick comments:
In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Around line 845-880: The test currently verifies retrieve_response_generator
returns the refusal generator but doesn't exercise the wrapper that caused the
regression; update the test to take the returned generator from
retrieve_response_generator (call it _generator) and pass it into
generate_response (or call the full streaming endpoint) so the refusal generator
is executed through generate_response, then assert that get_topic_summary() and
consume_query_tokens() were not called/modified for the blocked moderation case
and that the TurnSummary produced by generate_response (or the endpoint) retains
llm_response == "Content blocked"; use the existing mocked objects
(mock_context, mock_client, mock_moderation_result) and patch spies on
get_topic_summary and consume_query_tokens to validate they remain untouched.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2fd22921-2622-47ed-b58d-b3290b000e77

📥 Commits

Reviewing files that changed from the base of the PR and between dac0275 and 047e955.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (16)

pyproject.toml
src/app/endpoints/query.py
src/app/endpoints/streaming_query.py
src/constants.py
src/models/context.py
src/models/requests.py
src/utils/conversations.py
src/utils/responses.py
src/utils/shields.py
src/utils/types.py
src/utils/vector_search.py
tests/e2e/features/info.feature
tests/unit/app/endpoints/test_query.py
tests/unit/app/endpoints/test_streaming_query.py
tests/unit/utils/test_shields.py
tests/unit/utils/test_vector_search.py

💤 Files with no reviewable changes (1)

src/utils/types.py

🚧 Files skipped from review as they are similar to previous changes (5)

src/utils/vector_search.py
src/models/context.py
src/utils/responses.py
src/constants.py
tests/e2e/features/info.feature

src/models/requests.py

asimurka · 2026-03-12T08:24:29Z

src/utils/types.py

    tool_results: list[ToolResultSummary] = Field(default_factory=list)
    rag_chunks: list[RAGChunk] = Field(default_factory=list)
    referenced_documents: list[ReferencedDocument] = Field(default_factory=list)
-    inline_rag_documents: list[ReferencedDocument] = Field(default_factory=list)


Merged with tool-based docs into referenced_documents

are-ces

TY very nice, LGTM

Just a tiny nit.

are-ces · 2026-03-12T08:53:52Z

src/utils/shields.py

    available_models = {model.id for model in await client.models.list()}
-
    for shield in shields_to_run:
-        # Only validate provider_resource_id against models for llama-guard.


Do we want to leave some small comment here to remember why we have this check?

Returned a shorter version of the original comment

…uery

are-ces

LGTM

tisnik

LGTM looks good

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

src/app/endpoints/streaming_query.py Show resolved Hide resolved

src/utils/shields.py Show resolved Hide resolved

src/utils/types.py Outdated Show resolved Hide resolved

asimurka force-pushed the moderations_api_replacement branch from 36c2fa3 to b93f99c Compare March 9, 2026 13:54

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

asimurka force-pushed the moderations_api_replacement branch from b93f99c to 1261239 Compare March 9, 2026 14:31

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

src/utils/shields.py Show resolved Hide resolved

asimurka requested review from are-ces and tisnik March 10, 2026 08:13

asimurka commented Mar 10, 2026

View reviewed changes

asimurka force-pushed the moderations_api_replacement branch from 1261239 to dac0275 Compare March 10, 2026 15:56

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

src/utils/shields.py Show resolved Hide resolved

asimurka force-pushed the moderations_api_replacement branch 2 times, most recently from c727d41 to 047e955 Compare March 11, 2026 07:14

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/models/requests.py Show resolved Hide resolved

asimurka force-pushed the moderations_api_replacement branch from 047e955 to b9ae26e Compare March 12, 2026 08:22

asimurka commented Mar 12, 2026

View reviewed changes

are-ces approved these changes Mar 12, 2026

View reviewed changes

asimurka force-pushed the moderations_api_replacement branch from b9ae26e to 726d986 Compare March 12, 2026 09:25

Refactor of shield moderation

8d2deee

asimurka force-pushed the moderations_api_replacement branch from 726d986 to 8d2deee Compare March 12, 2026 09:26

asimurka requested a review from are-ces March 12, 2026 09:53

asimurka force-pushed the moderations_api_replacement branch from d7ef69a to b15482f Compare March 12, 2026 10:34

Added merging of chunks from inline and tool-based RAG in streaming q…

524ac1c

…uery

asimurka force-pushed the moderations_api_replacement branch from b15482f to 524ac1c Compare March 12, 2026 10:47

are-ces approved these changes Mar 12, 2026

View reviewed changes

asimurka changed the title ~~LCORE-1409: Refactor of shield moderation~~ LCORE-1409: Refactor of shield moderation and inline RAG content persistence Mar 12, 2026

tisnik approved these changes Mar 12, 2026

View reviewed changes

tisnik merged commit 482ac5e into lightspeed-core:main Mar 12, 2026
23 checks passed

Conversation

asimurka commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asimurka Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asimurka Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

are-ces left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asimurka commented Mar 9, 2026 •

edited

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

asimurka Mar 10, 2026 •

edited

Loading

asimurka Mar 10, 2026 •

edited

Loading