Skip to content

Conversation

@Danelegend
Copy link
Contributor

@Danelegend Danelegend commented Jan 9, 2026

Description

There is a bug where if the fastapi client stops yielding from the stream, the flow halts on a 'yield from' and does not progress onwards. If this path occurs, the code that occurs after will not run. This includes logic to save the chat to the database.

This PR moves the logic into a callback that is run at the conclusion of the llm thread. This pr does somewhat implicitly use the magic of the 'finally' block in that it is guaranteed (afaik) to be run, but it still makes semantic sense in this instance

How Has This Been Tested?

Manually test cases including sending messages and refreshing mid stream. Also tested stopping message scenario.

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Fixes a bug where the chat flow could stall if the FastAPI stream stopped, causing the post-run logic (including DB save) to be skipped. Moves post-LLM work into a completion callback so the chat turn is always finalized and saved.

  • Bug Fixes
    • Added completion_callback to run_chat_loop_with_state_containers, runs after the background thread; emits PacketException on failure.
    • Added llm_loop_completion_handle to build the final answer, collect citations, and save the assistant message.
    • Handles user-stopped runs by appending a stop notice and still saving the turn.

Written for commit 05b56df. Summary will update on new commits.

@Danelegend Danelegend requested a review from a team as a code owner January 9, 2026 06:32
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR attempts to fix a bug where chat state isn't saved to the database when the FastAPI stream stops prematurely. The fix moves the post-LLM logic (building final answer, collecting citations, saving to DB) into a completion_callback that runs in a finally block, ensuring it executes even if the stream halts.

Changes:

  • Added completion_callback parameter to run_chat_loop_with_state_containers that executes in the finally block after the LLM thread completes
  • Extracted post-LLM logic into new function llm_loop_completion_handle
  • Created callback llm_loop_completion_callback that invokes the handle function with necessary context

Issues:
The implementation has a critical flaw with exception handling. When an exception occurs in the LLM loop:

  1. The exception is caught and emitted as PacketException
  2. It's re-raised in the main thread (line 179 of chat_state.py)
  3. The finally block executes the callback, which calls save_chat_turn and commits to DB
  4. The exception continues propagating to handle_stream_message_objects
  5. Exception handlers there call db_session.rollback(), undoing the save

This means exceptions will cause the chat to be saved then immediately rolled back, defeating the purpose of the fix. The callback needs to only run on successful completion, or the exception handlers need to be aware that the callback has already run and avoid rolling back.

Confidence Score: 1/5

  • Not safe to merge - contains critical database consistency bug with exception handling
  • The PR introduces a critical bug where exceptions cause the chat to be saved and then immediately rolled back. When an exception occurs in the LLM loop, the completion callback runs in the finally block (committing chat state to DB), but then the exception propagates to outer exception handlers that call db_session.rollback(). This creates database inconsistency and defeats the purpose of the fix. Additionally, the callback itself has no exception handling, which could mask original exceptions.
  • backend/onyx/chat/chat_state.py (callback execution timing), backend/onyx/chat/process_message.py (exception handling in callback)

Important Files Changed

File Analysis

Filename Score Overview
backend/onyx/chat/chat_state.py 2/5 Added completion_callback parameter that runs in finally block after LLM loop completes, enabling post-processing regardless of success/failure
backend/onyx/chat/process_message.py 2/5 Moved post-LLM logic into llm_loop_completion_handle callback to ensure chat is saved even if stream stops, but has exception handling issue with db rollback

Sequence Diagram

sequenceDiagram
    participant Client
    participant handle_stream_message_objects
    participant run_chat_loop_with_state_containers
    participant Background Thread
    participant Callback
    participant DB

    Client->>handle_stream_message_objects: Send message
    handle_stream_message_objects->>run_chat_loop_with_state_containers: Start LLM loop
    run_chat_loop_with_state_containers->>Background Thread: Launch LLM execution
    
    alt Normal completion
        Background Thread->>run_chat_loop_with_state_containers: Emit packets
        run_chat_loop_with_state_containers->>Client: Yield packets
        Background Thread->>run_chat_loop_with_state_containers: Emit OverallStop
        Note over run_chat_loop_with_state_containers: Finally block executes
        run_chat_loop_with_state_containers->>Callback: completion_callback(state_container)
        Callback->>DB: save_chat_turn() + commit()
        run_chat_loop_with_state_containers-->>handle_stream_message_objects: Return
    else Exception in LLM
        Background Thread->>Background Thread: Exception occurs
        Background Thread->>run_chat_loop_with_state_containers: Emit PacketException
        run_chat_loop_with_state_containers->>run_chat_loop_with_state_containers: Re-raise exception (line 179)
        Note over run_chat_loop_with_state_containers: Finally block executes FIRST
        run_chat_loop_with_state_containers->>Callback: completion_callback(state_container)
        Callback->>DB: save_chat_turn() + commit()
        Note over DB: Chat saved to DB
        run_chat_loop_with_state_containers-->>handle_stream_message_objects: Exception propagates
        handle_stream_message_objects->>DB: rollback()
        Note over DB: CRITICAL: Chat save rolled back!
        handle_stream_message_objects->>Client: Yield StreamingError
    else User cancellation
        Client->>run_chat_loop_with_state_containers: Disconnect
        run_chat_loop_with_state_containers->>run_chat_loop_with_state_containers: Detect !is_connected()
        Note over run_chat_loop_with_state_containers: Finally block executes
        run_chat_loop_with_state_containers->>Callback: completion_callback(state_container)
        Callback->>DB: save_chat_turn() with stop message + commit()
        run_chat_loop_with_state_containers-->>handle_stream_message_objects: Return
    end
Loading

Comment on lines +678 to +682
if completed_normally:
if state_container.answer_tokens is None:
raise RuntimeError(
"LLM run completed normally but did not return an answer."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] When the LLM completes normally but answer_tokens is None, this raises RuntimeError. However, since this runs in the completion callback (which executes in the finally block), this exception will propagate differently than the original design where it was caught by the exception handlers in handle_stream_message_objects. The error handling path has changed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/chat/process_message.py
Line: 678:682

Comment:
[P2] When the LLM completes normally but `answer_tokens` is None, this raises RuntimeError. However, since this runs in the completion callback (which executes in the `finally` block), this exception will propagate differently than the original design where it was caught by the exception handlers in `handle_stream_message_objects`. The error handling path has changed.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled in the layer that runs this func

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 9, 2026

Additional Comments (1)

backend/onyx/chat/chat_state.py
[P0] The callback runs before exceptions propagate, causing database inconsistency. When a PacketException is raised at line 179, the finally block executes the callback (which commits to DB), then the exception continues to handle_stream_message_objects where db_session.rollback() is called (lines 615, 662 in process_message.py). This rolls back the chat save that just happened. The callback should only run on successful completion, not when exceptions occur.

Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/onyx/chat/chat_state.py
Line: 195:200

Comment:
[P0] The callback runs before exceptions propagate, causing database inconsistency. When a `PacketException` is raised at line 179, the `finally` block executes the callback (which commits to DB), then the exception continues to `handle_stream_message_objects` where `db_session.rollback()` is called (lines 615, 662 in process_message.py). This rolls back the chat save that just happened. The callback should only run on successful completion, not when exceptions occur.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor

@evan-onyx evan-onyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add an integration test for this before merging?

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/tests/integration/common_utils/managers/chat.py">

<violation number="1" location="backend/tests/integration/common_utils/managers/chat.py:196">
P2: Docstring references non-existent parameter `disconnect_after_type`. This parameter is documented but not present in the function signature. Either add the parameter or remove the documentation.</violation>

<violation number="2" location="backend/tests/integration/common_utils/managers/chat.py:202">
P2: Docstring incorrectly states the function returns a `StreamedResponse`, but the function is annotated as `-> None` and returns `None`. Update the docstring to reflect the actual return type.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

message: The message to send
disconnect_after_packets: Disconnect after receiving this many packets.
If None, disconnect_after_type must be specified.
disconnect_after_type: Disconnect after receiving a packet of this type
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Docstring references non-existent parameter disconnect_after_type. This parameter is documented but not present in the function signature. Either add the parameter or remove the documentation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/integration/common_utils/managers/chat.py, line 196:

<comment>Docstring references non-existent parameter `disconnect_after_type`. This parameter is documented but not present in the function signature. Either add the parameter or remove the documentation.</comment>

<file context>
@@ -164,6 +164,87 @@ def send_message(
+            message: The message to send
+            disconnect_after_packets: Disconnect after receiving this many packets.
+                If None, disconnect_after_type must be specified.
+            disconnect_after_type: Disconnect after receiving a packet of this type
+                (e.g., "message_start", "search_tool_start"). If None,
+                disconnect_after_packets must be specified.
</file context>
Fix with Cubic

Comment on lines +202 to +203
StreamedResponse containing data received before disconnect,
with is_disconnected=True flag set.
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Docstring incorrectly states the function returns a StreamedResponse, but the function is annotated as -> None and returns None. Update the docstring to reflect the actual return type.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/tests/integration/common_utils/managers/chat.py, line 202:

<comment>Docstring incorrectly states the function returns a `StreamedResponse`, but the function is annotated as `-> None` and returns `None`. Update the docstring to reflect the actual return type.</comment>

<file context>
@@ -164,6 +164,87 @@ def send_message(
+            ... (other standard message parameters)
+
+        Returns:
+            StreamedResponse containing data received before disconnect,
+            with is_disconnected=True flag set.
+        """
</file context>
Suggested change
StreamedResponse containing data received before disconnect,
with is_disconnected=True flag set.
None. This method simulates a disconnect and does not return response data.
Fix with Cubic

@Danelegend Danelegend added this pull request to the merge queue Jan 9, 2026
Merged via the queue into main with commit f4d4d14 Jan 9, 2026
74 checks passed
@Danelegend Danelegend deleted the db_save_in_thrd branch January 9, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants