Skip to content

Conversation

@onyxtrialee3
Copy link
Collaborator

@onyxtrialee3 onyxtrialee3 commented Jan 9, 2026

Description

This branch adds multi-model chat functionality, allowing users to select up to 3 AI models and receive parallel responses from each. The responses are displayed in a side-by-side comparison view where users can select their preferred response.

How Has This Been Tested?

Manual testing locally

Additional Options

  • [Optional] Override Linear Check

Summary by cubic

Adds multi-model chat end-to-end: run 2–3 models in parallel per message, stream packets tagged by model_index, persist per-model results, and render side-by-side responses in the UI.

  • New Features

    • Enable multi-model mode via SendMessageRequest.llm_overrides (2–3); deep_research is disabled in this mode.
    • Stream packets with Placement.model_index; per-model stop/error is handled. Emit MultiModelMessageResponseIDInfo with user_message_id, reserved_assistant_message_ids, and model_names.
    • Frontend: MultiModelSelector to pick models, controller sends llm_overrides, and MultiModelResponseView shows parallel answers. Single-model flow (llm_override) is unchanged.
  • Migration

    • To enable the UI, set cookie multi-model-enabled=true.
    • When using multi-model, send 2–3 LLMOverride entries in llm_overrides. Handle MultiModelMessageResponseIDInfo and render streams using Placement.model_index.
    • If llm_overrides is not provided, existing single-model behavior continues.

Written for commit ec1f0a4. Summary will update on new commits.

@onyxtrialee3 onyxtrialee3 requested a review from a team as a code owner January 9, 2026 00:32
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/chat/process_message.py">

<violation number="1" location="backend/onyx/chat/process_message.py:348">
P1: The same `db_session` is shared across 3 parallel threads, but SQLAlchemy sessions are not thread-safe. This contradicts the existing comment that justifies single-threaded db_session usage. Consider creating separate sessions per thread or using a scoped session.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@onyxtrialee3 onyxtrialee3 marked this pull request as draft January 9, 2026 00:37
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR implements multi-model chat functionality, allowing the backend to run 3 LLM models in parallel and stream their responses with unique identifiers. The implementation adds new data models and threading logic to support concurrent model execution.

Major changes:

  • Adds MultiModelMessageResponseIDInfo model to track multiple assistant message IDs
  • Adds llm_overrides field to SendMessageRequest for specifying 3 models
  • Adds model_index to Placement for routing packets to correct model in UI
  • Implements _run_multi_model_chat_loops() to orchestrate parallel model execution using threads and a shared queue

Critical issues found:

  • Missing validation that llm_overrides contains exactly 3 elements (silently falls back to single-model)
  • Dead code: unused list comprehension creating threading.Event objects (line 310)
  • Incorrect telemetry: MULTIPLE_ASSISTANTS milestone tracked for ALL chats, not just multi-model
  • Missing LLM cost limit checks for the 3 models (only checks the default LLM)
  • No validation that any model succeeded before attempting to save all 3 responses
  • Thread cleanup timeout may be too short (5s) for LLM operations

Style issues:

  • Missing explicit type annotations in ModelIndexEmitter class (violates custom guidelines)
  • Threads not created as daemons (could prevent clean process exit)
  • Model name fallback logic could produce confusing names like "Model Model 1"

Confidence Score: 1/5

  • This PR has multiple critical issues that could cause runtime failures and incorrect behavior in production
  • Score reflects several critical logic errors including missing validation (silent fallback behavior), dead code, incorrect telemetry tracking for all users, missing cost limit checks that could violate usage policies, and inadequate error handling that could save corrupt data when all models fail. The threading implementation also has potential resource leaks with insufficient cleanup timeouts
  • Primary attention needed on backend/onyx/chat/process_message.py - contains the bulk of issues including dead code, missing validation, and error handling problems. Also review backend/onyx/server/query_and_chat/models.py to add proper validation

Important Files Changed

File Analysis

Filename Score Overview
backend/onyx/chat/models.py 4/5 Adds MultiModelMessageResponseIDInfo model for multi-model chat responses - clean addition with no issues
backend/onyx/server/query_and_chat/models.py 3/5 Adds llm_overrides field but lacks validation to enforce mutual exclusivity with llm_override and exactly 3 elements requirement
backend/onyx/server/query_and_chat/placement.py 5/5 Adds model_index field to Placement - straightforward addition with clear documentation
backend/onyx/chat/process_message.py 1/5 Major implementation with multiple critical issues: missing validation, dead code, incorrect telemetry tracking, missing LLM cost checks, inadequate error handling for multi-model failures, and threading concerns

Sequence Diagram

sequenceDiagram
    participant Client
    participant handle_stream_message_objects
    participant Multi-Model Logic
    participant Thread 1 (Model 1)
    participant Thread 2 (Model 2)
    participant Thread 3 (Model 3)
    participant Shared Queue
    participant DB

    Client->>handle_stream_message_objects: SendMessageRequest with llm_overrides[3]
    
    handle_stream_message_objects->>Multi-Model Logic: Check len(llm_overrides) == 3
    
    Multi-Model Logic->>Multi-Model Logic: Create 3 LLM instances
    Multi-Model Logic->>DB: Reserve 3 assistant message IDs
    DB-->>Multi-Model Logic: Return message IDs
    Multi-Model Logic->>Client: Yield MultiModelMessageResponseIDInfo
    
    Multi-Model Logic->>Thread 1 (Model 1): Start run_model_loop(model_index=0)
    Multi-Model Logic->>Thread 2 (Model 2): Start run_model_loop(model_index=1)
    Multi-Model Logic->>Thread 3 (Model 3): Start run_model_loop(model_index=2)
    
    par Model 1 Generation
        Thread 1 (Model 1)->>Thread 1 (Model 1): run_llm_loop()
        Thread 1 (Model 1)->>Shared Queue: Put (0, Packet with model_index=0)
    and Model 2 Generation
        Thread 2 (Model 2)->>Thread 2 (Model 2): run_llm_loop()
        Thread 2 (Model 2)->>Shared Queue: Put (1, Packet with model_index=1)
    and Model 3 Generation
        Thread 3 (Model 3)->>Thread 3 (Model 3): run_llm_loop()
        Thread 3 (Model 3)->>Shared Queue: Put (2, Packet with model_index=2)
    end
    
    loop While not all completed
        Shared Queue->>Multi-Model Logic: Get packet with timeout=0.3s
        Multi-Model Logic->>Client: Yield packet (tagged with model_index)
    end
    
    Thread 1 (Model 1)->>Shared Queue: Put (0, None) - Signal completion
    Thread 2 (Model 2)->>Shared Queue: Put (1, None) - Signal completion
    Thread 3 (Model 3)->>Shared Queue: Put (2, None) - Signal completion
    
    Multi-Model Logic->>Multi-Model Logic: All threads completed
    
    loop For each model
        Multi-Model Logic->>DB: save_chat_turn(assistant_message, final_answer, citations)
    end
    
    Multi-Model Logic-->>Client: Complete streaming
Loading


# Shared queue for all model outputs
shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue()
[threading.Event() for _ in llms]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code - this list comprehension creates threading.Event objects that are never assigned or used

Suggested change
[threading.Event() for _ in llms]
# Shared queue for all model outputs
shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue()

Comment on lines 688 to 693
if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3:
# Multi-model chat: run 3 models in parallel
if new_msg_req.deep_research:
raise RuntimeError(
"Deep research is not supported for multi-model chat"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation - the comment says "exactly 3 required" but there's no enforcement. If user passes 2, 4, or any other number, it silently falls through to single-model mode. Add explicit validation:

Suggested change
if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3:
# Multi-model chat: run 3 models in parallel
if new_msg_req.deep_research:
raise RuntimeError(
"Deep research is not supported for multi-model chat"
)
# Check for multi-model chat mode
if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) > 0:
if len(new_msg_req.llm_overrides) != 3:
raise ValueError(
f"Multi-model chat requires exactly 3 LLM overrides, got {len(new_msg_req.llm_overrides)}"
)
# Multi-model chat: run 3 models in parallel
if new_msg_req.deep_research:
raise RuntimeError(
"Deep research is not supported for multi-model chat"
)

Comment on lines +707 to +711
model_names.append(
llm_override.model_version
or llm_override.model_provider
or f"Model {len(llms)}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model name fallback creates confusing display. When both model_version and model_provider are None, this generates "Model 1", "Model 2", "Model 3". But line 761 uses f"Model {model_names[i]}" which would produce "Model Model 1". Consider: llm_override.model_version or llm_override.model_provider or f"Model {i + 1}"

Comment on lines +695 to +711
# Create LLM instances for each model override
llms: list[LLM] = []
model_names: list[str] = []
for llm_override in new_msg_req.llm_overrides:
model_llm = get_llm_for_persona(
persona=persona,
user=user,
llm_override=llm_override,
additional_headers=litellm_additional_headers,
long_term_logger=long_term_logger,
)
llms.append(model_llm)
model_names.append(
llm_override.model_version
or llm_override.model_provider
or f"Model {len(llms)}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing LLM cost limit checks for multi-model. The single-model path checks cost limits at line 512-518, but multi-model creates 3 LLM instances without checking if each model's API key is within cost limits. Should validate cost limits for each llm_override before creating the LLM instances

Comment on lines +754 to 795
completed_normally = check_is_connected()
for i, (assistant_response, state_container) in enumerate(
zip(assistant_responses, state_containers)
):
if completed_normally:
if state_container.answer_tokens is None:
final_answer = (
f"Model {model_names[i]} did not return an answer."
)
else:
final_answer = state_container.answer_tokens
else:
if state_container.answer_tokens:
final_answer = (
state_container.answer_tokens
+ " ... The generation was stopped by the user here."
)
else:
final_answer = "The generation was stopped by the user."

# Build citation_docs_info from accumulated citations
citation_docs_info: list[CitationDocInfo] = []
seen_citation_nums: set[int] = set()
for citation_num, search_doc in state_container.citation_to_doc.items():
if citation_num not in seen_citation_nums:
seen_citation_nums.add(citation_num)
citation_docs_info.append(
CitationDocInfo(
search_doc=search_doc,
citation_number=citation_num,
)
)

save_chat_turn(
message_text=final_answer,
reasoning_tokens=state_container.reasoning_tokens,
citation_docs_info=citation_docs_info,
tool_calls=state_container.tool_calls,
db_session=db_session,
assistant_message=assistant_response,
is_clarification=state_container.is_clarification,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No validation that any model succeeded before saving. If all 3 models fail (exceptions caught in _run_multi_model_chat_loops), the code still tries to save all responses. Check if state_container.answer_tokens is None for all models and handle appropriately (either raise error or save partial results with clear error messages)

# Start all model loops in parallel
threads = []
for i, (llm, state_container) in enumerate(zip(llms, state_containers)):
thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread daemon mode not set - threads should be created with daemon=True to ensure they don't prevent the process from exiting if the main thread terminates unexpectedly

Suggested change
thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container))
thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container), daemon=True)

Comment on lines +430 to +433
finally:
# Wait for all threads to complete
for thread in threads:
thread.join(timeout=5.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread cleanup timeout is too short - 5 second timeout may not be enough for LLM loops to clean up properly, especially if they're mid-generation. Consider increasing to 10-30 seconds or making it configurable. Threads that don't join will leak resources

Comment on lines 86 to 87
# For multi-model chat: list of LLM overrides to compare (exactly 3 required)
llm_overrides: list[LLMOverride] | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation in model - add a validator to ensure that if llm_overrides is provided, it contains exactly 3 elements and llm_override is None (they should be mutually exclusive)

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 9, 2026

Additional Comments (1)

backend/onyx/chat/process_message.py
Incorrect telemetry - MULTIPLE_ASSISTANTS milestone is tracked for ALL chat sessions, not just multi-model ones. This should only fire when actually using multi-model chat (move inside the multi-model conditional block at line 688)

@onyxtrialee3 onyxtrialee3 changed the title feat: support multi-model chat from backend feat: support multi-model chat Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants