feat: support multi-model chat #7302

onyxtrialee3 · 2026-01-09T00:32:00Z

Description

This branch adds multi-model chat functionality, allowing users to select up to 3 AI models and receive parallel responses from each. The responses are displayed in a side-by-side comparison view where users can select their preferred response.

How Has This Been Tested?

Manual testing locally

Additional Options

[Optional] Override Linear Check

Summary by cubic

Adds multi-model chat end-to-end: run 2–3 models in parallel per message, stream packets tagged by model_index, persist per-model results, and render side-by-side responses in the UI.

New Features
- Enable multi-model mode via SendMessageRequest.llm_overrides (2–3); deep_research is disabled in this mode.
- Stream packets with Placement.model_index; per-model stop/error is handled. Emit MultiModelMessageResponseIDInfo with user_message_id, reserved_assistant_message_ids, and model_names.
- Frontend: MultiModelSelector to pick models, controller sends llm_overrides, and MultiModelResponseView shows parallel answers. Single-model flow (llm_override) is unchanged.
Migration
- To enable the UI, set cookie multi-model-enabled=true.
- When using multi-model, send 2–3 LLMOverride entries in llm_overrides. Handle MultiModelMessageResponseIDInfo and render streams using Placement.model_index.
- If llm_overrides is not provided, existing single-model behavior continues.

^{Written for commit ec1f0a4. Summary will update on new commits.}

cubic-dev-ai

1 issue found across 4 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/onyx/chat/process_message.py">

<violation number="1" location="backend/onyx/chat/process_message.py:348">
P1: The same `db_session` is shared across 3 parallel threads, but SQLAlchemy sessions are not thread-safe. This contradicts the existing comment that justifies single-threaded db_session usage. Consider creating separate sessions per thread or using a scoped session.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/onyx/chat/process_message.py

greptile-apps

Greptile Overview

Greptile Summary

This PR implements multi-model chat functionality, allowing the backend to run 3 LLM models in parallel and stream their responses with unique identifiers. The implementation adds new data models and threading logic to support concurrent model execution.

Major changes:

Adds MultiModelMessageResponseIDInfo model to track multiple assistant message IDs
Adds llm_overrides field to SendMessageRequest for specifying 3 models
Adds model_index to Placement for routing packets to correct model in UI
Implements _run_multi_model_chat_loops() to orchestrate parallel model execution using threads and a shared queue

Critical issues found:

Missing validation that llm_overrides contains exactly 3 elements (silently falls back to single-model)
Dead code: unused list comprehension creating threading.Event objects (line 310)
Incorrect telemetry: MULTIPLE_ASSISTANTS milestone tracked for ALL chats, not just multi-model
Missing LLM cost limit checks for the 3 models (only checks the default LLM)
No validation that any model succeeded before attempting to save all 3 responses
Thread cleanup timeout may be too short (5s) for LLM operations

Style issues:

Missing explicit type annotations in ModelIndexEmitter class (violates custom guidelines)
Threads not created as daemons (could prevent clean process exit)
Model name fallback logic could produce confusing names like "Model Model 1"

Confidence Score: 1/5

This PR has multiple critical issues that could cause runtime failures and incorrect behavior in production
Score reflects several critical logic errors including missing validation (silent fallback behavior), dead code, incorrect telemetry tracking for all users, missing cost limit checks that could violate usage policies, and inadequate error handling that could save corrupt data when all models fail. The threading implementation also has potential resource leaks with insufficient cleanup timeouts
Primary attention needed on backend/onyx/chat/process_message.py - contains the bulk of issues including dead code, missing validation, and error handling problems. Also review backend/onyx/server/query_and_chat/models.py to add proper validation

Important Files Changed

File Analysis

Filename	Score	Overview
backend/onyx/chat/models.py	4/5	Adds `MultiModelMessageResponseIDInfo` model for multi-model chat responses - clean addition with no issues
backend/onyx/server/query_and_chat/models.py	3/5	Adds `llm_overrides` field but lacks validation to enforce mutual exclusivity with `llm_override` and exactly 3 elements requirement
backend/onyx/server/query_and_chat/placement.py	5/5	Adds `model_index` field to `Placement` - straightforward addition with clear documentation
backend/onyx/chat/process_message.py	1/5	Major implementation with multiple critical issues: missing validation, dead code, incorrect telemetry tracking, missing LLM cost checks, inadequate error handling for multi-model failures, and threading concerns

Sequence Diagram

sequenceDiagram
    participant Client
    participant handle_stream_message_objects
    participant Multi-Model Logic
    participant Thread 1 (Model 1)
    participant Thread 2 (Model 2)
    participant Thread 3 (Model 3)
    participant Shared Queue
    participant DB

    Client->>handle_stream_message_objects: SendMessageRequest with llm_overrides[3]
    
    handle_stream_message_objects->>Multi-Model Logic: Check len(llm_overrides) == 3
    
    Multi-Model Logic->>Multi-Model Logic: Create 3 LLM instances
    Multi-Model Logic->>DB: Reserve 3 assistant message IDs
    DB-->>Multi-Model Logic: Return message IDs
    Multi-Model Logic->>Client: Yield MultiModelMessageResponseIDInfo
    
    Multi-Model Logic->>Thread 1 (Model 1): Start run_model_loop(model_index=0)
    Multi-Model Logic->>Thread 2 (Model 2): Start run_model_loop(model_index=1)
    Multi-Model Logic->>Thread 3 (Model 3): Start run_model_loop(model_index=2)
    
    par Model 1 Generation
        Thread 1 (Model 1)->>Thread 1 (Model 1): run_llm_loop()
        Thread 1 (Model 1)->>Shared Queue: Put (0, Packet with model_index=0)
    and Model 2 Generation
        Thread 2 (Model 2)->>Thread 2 (Model 2): run_llm_loop()
        Thread 2 (Model 2)->>Shared Queue: Put (1, Packet with model_index=1)
    and Model 3 Generation
        Thread 3 (Model 3)->>Thread 3 (Model 3): run_llm_loop()
        Thread 3 (Model 3)->>Shared Queue: Put (2, Packet with model_index=2)
    end
    
    loop While not all completed
        Shared Queue->>Multi-Model Logic: Get packet with timeout=0.3s
        Multi-Model Logic->>Client: Yield packet (tagged with model_index)
    end
    
    Thread 1 (Model 1)->>Shared Queue: Put (0, None) - Signal completion
    Thread 2 (Model 2)->>Shared Queue: Put (1, None) - Signal completion
    Thread 3 (Model 3)->>Shared Queue: Put (2, None) - Signal completion
    
    Multi-Model Logic->>Multi-Model Logic: All threads completed
    
    loop For each model
        Multi-Model Logic->>DB: save_chat_turn(assistant_message, final_answer, citations)
    end
    
    Multi-Model Logic-->>Client: Complete streaming

greptile-apps · 2026-01-09T00:37:22Z

backend/onyx/chat/process_message.py

+
+    # Shared queue for all model outputs
+    shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue()
+    [threading.Event() for _ in llms]


Dead code - this list comprehension creates threading.Event objects that are never assigned or used

Suggested change

[threading.Event() for _ in llms]

# Shared queue for all model outputs

shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue()

greptile-apps · 2026-01-09T00:37:23Z

backend/onyx/chat/process_message.py

+        if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3:
+            # Multi-model chat: run 3 models in parallel
+            if new_msg_req.deep_research:
+                raise RuntimeError(
+                    "Deep research is not supported for multi-model chat"
+                )


Missing validation - the comment says "exactly 3 required" but there's no enforcement. If user passes 2, 4, or any other number, it silently falls through to single-model mode. Add explicit validation:

Suggested change

if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3:

# Multi-model chat: run 3 models in parallel

if new_msg_req.deep_research:

raise RuntimeError(

"Deep research is not supported for multi-model chat"

)

# Check for multi-model chat mode

if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) > 0:

if len(new_msg_req.llm_overrides) != 3:

raise ValueError(

f"Multi-model chat requires exactly 3 LLM overrides, got {len(new_msg_req.llm_overrides)}"

)

# Multi-model chat: run 3 models in parallel

if new_msg_req.deep_research:

raise RuntimeError(

"Deep research is not supported for multi-model chat"

)

greptile-apps · 2026-01-09T00:37:27Z

backend/onyx/chat/process_message.py

+                model_names.append(
+                    llm_override.model_version
+                    or llm_override.model_provider
+                    or f"Model {len(llms)}"
+                )


Model name fallback creates confusing display. When both model_version and model_provider are None, this generates "Model 1", "Model 2", "Model 3". But line 761 uses f"Model {model_names[i]}" which would produce "Model Model 1". Consider: llm_override.model_version or llm_override.model_provider or f"Model {i + 1}"

greptile-apps · 2026-01-09T00:37:28Z

backend/onyx/chat/process_message.py

+            # Create LLM instances for each model override
+            llms: list[LLM] = []
+            model_names: list[str] = []
+            for llm_override in new_msg_req.llm_overrides:
+                model_llm = get_llm_for_persona(
+                    persona=persona,
+                    user=user,
+                    llm_override=llm_override,
+                    additional_headers=litellm_additional_headers,
+                    long_term_logger=long_term_logger,
+                )
+                llms.append(model_llm)
+                model_names.append(
+                    llm_override.model_version
+                    or llm_override.model_provider
+                    or f"Model {len(llms)}"
+                )


Missing LLM cost limit checks for multi-model. The single-model path checks cost limits at line 512-518, but multi-model creates 3 LLM instances without checking if each model's API key is within cost limits. Should validate cost limits for each llm_override before creating the LLM instances

greptile-apps · 2026-01-09T00:37:29Z

backend/onyx/chat/process_message.py

+            completed_normally = check_is_connected()
+            for i, (assistant_response, state_container) in enumerate(
+                zip(assistant_responses, state_containers)
+            ):
+                if completed_normally:
+                    if state_container.answer_tokens is None:
+                        final_answer = (
+                            f"Model {model_names[i]} did not return an answer."
+                        )
+                    else:
+                        final_answer = state_container.answer_tokens
+                else:
+                    if state_container.answer_tokens:
+                        final_answer = (
+                            state_container.answer_tokens
+                            + " ... The generation was stopped by the user here."
+                        )
+                    else:
+                        final_answer = "The generation was stopped by the user."
+
+                # Build citation_docs_info from accumulated citations
+                citation_docs_info: list[CitationDocInfo] = []
+                seen_citation_nums: set[int] = set()
+                for citation_num, search_doc in state_container.citation_to_doc.items():
+                    if citation_num not in seen_citation_nums:
+                        seen_citation_nums.add(citation_num)
+                        citation_docs_info.append(
+                            CitationDocInfo(
+                                search_doc=search_doc,
+                                citation_number=citation_num,
+                            )
+                        )
+
+                save_chat_turn(
+                    message_text=final_answer,
+                    reasoning_tokens=state_container.reasoning_tokens,
+                    citation_docs_info=citation_docs_info,
+                    tool_calls=state_container.tool_calls,
+                    db_session=db_session,
+                    assistant_message=assistant_response,
+                    is_clarification=state_container.is_clarification,
                )


No validation that any model succeeded before saving. If all 3 models fail (exceptions caught in _run_multi_model_chat_loops), the code still tries to save all responses. Check if state_container.answer_tokens is None for all models and handle appropriately (either raise error or save partial results with clear error messages)

greptile-apps · 2026-01-09T00:37:30Z

backend/onyx/chat/process_message.py

+    # Start all model loops in parallel
+    threads = []
+    for i, (llm, state_container) in enumerate(zip(llms, state_containers)):
+        thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container))


Thread daemon mode not set - threads should be created with daemon=True to ensure they don't prevent the process from exiting if the main thread terminates unexpectedly

Suggested change

thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container))

thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container), daemon=True)

greptile-apps · 2026-01-09T00:37:31Z

backend/onyx/chat/process_message.py

+    finally:
+        # Wait for all threads to complete
+        for thread in threads:
+            thread.join(timeout=5.0)


Thread cleanup timeout is too short - 5 second timeout may not be enough for LLM loops to clean up properly, especially if they're mid-generation. Consider increasing to 10-30 seconds or making it configurable. Threads that don't join will leak resources

backend/onyx/chat/process_message.py

greptile-apps · 2026-01-09T00:37:33Z

backend/onyx/server/query_and_chat/models.py

+    # For multi-model chat: list of LLM overrides to compare (exactly 3 required)
+    llm_overrides: list[LLMOverride] | None = None


Missing validation in model - add a validator to ensure that if llm_overrides is provided, it contains exactly 3 elements and llm_override is None (they should be mutually exclusive)

greptile-apps · 2026-01-09T00:37:36Z

Additional Comments (1)

backend/onyx/chat/process_message.py
Incorrect telemetry - MULTIPLE_ASSISTANTS milestone is tracked for ALL chat sessions, not just multi-model ones. This should only fire when actually using multi-model chat (move inside the multi-model conditional block at line 688)

…-model-chat' into sophia/enable-multi-model-chat

feat: support multi-model chat from backend

997bfad

onyxtrialee3 requested a review from a team as a code owner January 9, 2026 00:32

cubic-dev-ai bot reviewed Jan 9, 2026

View reviewed changes

backend/onyx/chat/process_message.py Show resolved Hide resolved

Merge branch 'main' into sophia/enable-multi-model-chat

a985056

onyxtrialee3 marked this pull request as draft January 9, 2026 00:37

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

XueTang422 added 2 commits January 8, 2026 19:19

feat: support multi model from frontend

32238fa

Merge remote-tracking branch 'refs/remotes/origin/sophia/enable-multi…

cf7eacb

…-model-chat' into sophia/enable-multi-model-chat

onyxtrialee3 changed the title ~~feat: support multi-model chat from backend~~ feat: support multi-model chat Jan 9, 2026

XueTang422 added 3 commits January 9, 2026 11:30

feat: enable proper highlight and message selection

1e95e43

feat: enable response streaming

9a671dd

feat: edit message should remember model selection

ec1f0a4

	[threading.Event() for _ in llms]
	# Shared queue for all model outputs
	shared_queue: queue.Queue[tuple[int, Packet \| None]] = queue.Queue()

	thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container))
	thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container), daemon=True)

		# For multi-model chat: list of LLM overrides to compare (exactly 3 required)
		llm_overrides: list[LLMOverride] \| None = None

feat: support multi-model chat #7302

Are you sure you want to change the base?

feat: support multi-model chat #7302

Uh oh!

Conversation

onyxtrialee3 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Additional Options

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

onyxtrialee3 commented Jan 9, 2026 •

edited

Loading