-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: support multi-model chat #7302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/onyx/chat/process_message.py">
<violation number="1" location="backend/onyx/chat/process_message.py:348">
P1: The same `db_session` is shared across 3 parallel threads, but SQLAlchemy sessions are not thread-safe. This contradicts the existing comment that justifies single-threaded db_session usage. Consider creating separate sessions per thread or using a scoped session.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR implements multi-model chat functionality, allowing the backend to run 3 LLM models in parallel and stream their responses with unique identifiers. The implementation adds new data models and threading logic to support concurrent model execution.
Major changes:
- Adds
MultiModelMessageResponseIDInfomodel to track multiple assistant message IDs - Adds
llm_overridesfield toSendMessageRequestfor specifying 3 models - Adds
model_indextoPlacementfor routing packets to correct model in UI - Implements
_run_multi_model_chat_loops()to orchestrate parallel model execution using threads and a shared queue
Critical issues found:
- Missing validation that
llm_overridescontains exactly 3 elements (silently falls back to single-model) - Dead code: unused list comprehension creating threading.Event objects (line 310)
- Incorrect telemetry:
MULTIPLE_ASSISTANTSmilestone tracked for ALL chats, not just multi-model - Missing LLM cost limit checks for the 3 models (only checks the default LLM)
- No validation that any model succeeded before attempting to save all 3 responses
- Thread cleanup timeout may be too short (5s) for LLM operations
Style issues:
- Missing explicit type annotations in
ModelIndexEmitterclass (violates custom guidelines) - Threads not created as daemons (could prevent clean process exit)
- Model name fallback logic could produce confusing names like "Model Model 1"
Confidence Score: 1/5
- This PR has multiple critical issues that could cause runtime failures and incorrect behavior in production
- Score reflects several critical logic errors including missing validation (silent fallback behavior), dead code, incorrect telemetry tracking for all users, missing cost limit checks that could violate usage policies, and inadequate error handling that could save corrupt data when all models fail. The threading implementation also has potential resource leaks with insufficient cleanup timeouts
- Primary attention needed on
backend/onyx/chat/process_message.py- contains the bulk of issues including dead code, missing validation, and error handling problems. Also reviewbackend/onyx/server/query_and_chat/models.pyto add proper validation
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| backend/onyx/chat/models.py | 4/5 | Adds MultiModelMessageResponseIDInfo model for multi-model chat responses - clean addition with no issues |
| backend/onyx/server/query_and_chat/models.py | 3/5 | Adds llm_overrides field but lacks validation to enforce mutual exclusivity with llm_override and exactly 3 elements requirement |
| backend/onyx/server/query_and_chat/placement.py | 5/5 | Adds model_index field to Placement - straightforward addition with clear documentation |
| backend/onyx/chat/process_message.py | 1/5 | Major implementation with multiple critical issues: missing validation, dead code, incorrect telemetry tracking, missing LLM cost checks, inadequate error handling for multi-model failures, and threading concerns |
Sequence Diagram
sequenceDiagram
participant Client
participant handle_stream_message_objects
participant Multi-Model Logic
participant Thread 1 (Model 1)
participant Thread 2 (Model 2)
participant Thread 3 (Model 3)
participant Shared Queue
participant DB
Client->>handle_stream_message_objects: SendMessageRequest with llm_overrides[3]
handle_stream_message_objects->>Multi-Model Logic: Check len(llm_overrides) == 3
Multi-Model Logic->>Multi-Model Logic: Create 3 LLM instances
Multi-Model Logic->>DB: Reserve 3 assistant message IDs
DB-->>Multi-Model Logic: Return message IDs
Multi-Model Logic->>Client: Yield MultiModelMessageResponseIDInfo
Multi-Model Logic->>Thread 1 (Model 1): Start run_model_loop(model_index=0)
Multi-Model Logic->>Thread 2 (Model 2): Start run_model_loop(model_index=1)
Multi-Model Logic->>Thread 3 (Model 3): Start run_model_loop(model_index=2)
par Model 1 Generation
Thread 1 (Model 1)->>Thread 1 (Model 1): run_llm_loop()
Thread 1 (Model 1)->>Shared Queue: Put (0, Packet with model_index=0)
and Model 2 Generation
Thread 2 (Model 2)->>Thread 2 (Model 2): run_llm_loop()
Thread 2 (Model 2)->>Shared Queue: Put (1, Packet with model_index=1)
and Model 3 Generation
Thread 3 (Model 3)->>Thread 3 (Model 3): run_llm_loop()
Thread 3 (Model 3)->>Shared Queue: Put (2, Packet with model_index=2)
end
loop While not all completed
Shared Queue->>Multi-Model Logic: Get packet with timeout=0.3s
Multi-Model Logic->>Client: Yield packet (tagged with model_index)
end
Thread 1 (Model 1)->>Shared Queue: Put (0, None) - Signal completion
Thread 2 (Model 2)->>Shared Queue: Put (1, None) - Signal completion
Thread 3 (Model 3)->>Shared Queue: Put (2, None) - Signal completion
Multi-Model Logic->>Multi-Model Logic: All threads completed
loop For each model
Multi-Model Logic->>DB: save_chat_turn(assistant_message, final_answer, citations)
end
Multi-Model Logic-->>Client: Complete streaming
|
|
||
| # Shared queue for all model outputs | ||
| shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue() | ||
| [threading.Event() for _ in llms] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code - this list comprehension creates threading.Event objects that are never assigned or used
| [threading.Event() for _ in llms] | |
| # Shared queue for all model outputs | |
| shared_queue: queue.Queue[tuple[int, Packet | None]] = queue.Queue() |
backend/onyx/chat/process_message.py
Outdated
| if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3: | ||
| # Multi-model chat: run 3 models in parallel | ||
| if new_msg_req.deep_research: | ||
| raise RuntimeError( | ||
| "Deep research is not supported for multi-model chat" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation - the comment says "exactly 3 required" but there's no enforcement. If user passes 2, 4, or any other number, it silently falls through to single-model mode. Add explicit validation:
| if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) == 3: | |
| # Multi-model chat: run 3 models in parallel | |
| if new_msg_req.deep_research: | |
| raise RuntimeError( | |
| "Deep research is not supported for multi-model chat" | |
| ) | |
| # Check for multi-model chat mode | |
| if new_msg_req.llm_overrides and len(new_msg_req.llm_overrides) > 0: | |
| if len(new_msg_req.llm_overrides) != 3: | |
| raise ValueError( | |
| f"Multi-model chat requires exactly 3 LLM overrides, got {len(new_msg_req.llm_overrides)}" | |
| ) | |
| # Multi-model chat: run 3 models in parallel | |
| if new_msg_req.deep_research: | |
| raise RuntimeError( | |
| "Deep research is not supported for multi-model chat" | |
| ) |
| model_names.append( | ||
| llm_override.model_version | ||
| or llm_override.model_provider | ||
| or f"Model {len(llms)}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model name fallback creates confusing display. When both model_version and model_provider are None, this generates "Model 1", "Model 2", "Model 3". But line 761 uses f"Model {model_names[i]}" which would produce "Model Model 1". Consider: llm_override.model_version or llm_override.model_provider or f"Model {i + 1}"
| # Create LLM instances for each model override | ||
| llms: list[LLM] = [] | ||
| model_names: list[str] = [] | ||
| for llm_override in new_msg_req.llm_overrides: | ||
| model_llm = get_llm_for_persona( | ||
| persona=persona, | ||
| user=user, | ||
| llm_override=llm_override, | ||
| additional_headers=litellm_additional_headers, | ||
| long_term_logger=long_term_logger, | ||
| ) | ||
| llms.append(model_llm) | ||
| model_names.append( | ||
| llm_override.model_version | ||
| or llm_override.model_provider | ||
| or f"Model {len(llms)}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing LLM cost limit checks for multi-model. The single-model path checks cost limits at line 512-518, but multi-model creates 3 LLM instances without checking if each model's API key is within cost limits. Should validate cost limits for each llm_override before creating the LLM instances
| completed_normally = check_is_connected() | ||
| for i, (assistant_response, state_container) in enumerate( | ||
| zip(assistant_responses, state_containers) | ||
| ): | ||
| if completed_normally: | ||
| if state_container.answer_tokens is None: | ||
| final_answer = ( | ||
| f"Model {model_names[i]} did not return an answer." | ||
| ) | ||
| else: | ||
| final_answer = state_container.answer_tokens | ||
| else: | ||
| if state_container.answer_tokens: | ||
| final_answer = ( | ||
| state_container.answer_tokens | ||
| + " ... The generation was stopped by the user here." | ||
| ) | ||
| else: | ||
| final_answer = "The generation was stopped by the user." | ||
|
|
||
| # Build citation_docs_info from accumulated citations | ||
| citation_docs_info: list[CitationDocInfo] = [] | ||
| seen_citation_nums: set[int] = set() | ||
| for citation_num, search_doc in state_container.citation_to_doc.items(): | ||
| if citation_num not in seen_citation_nums: | ||
| seen_citation_nums.add(citation_num) | ||
| citation_docs_info.append( | ||
| CitationDocInfo( | ||
| search_doc=search_doc, | ||
| citation_number=citation_num, | ||
| ) | ||
| ) | ||
|
|
||
| save_chat_turn( | ||
| message_text=final_answer, | ||
| reasoning_tokens=state_container.reasoning_tokens, | ||
| citation_docs_info=citation_docs_info, | ||
| tool_calls=state_container.tool_calls, | ||
| db_session=db_session, | ||
| assistant_message=assistant_response, | ||
| is_clarification=state_container.is_clarification, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No validation that any model succeeded before saving. If all 3 models fail (exceptions caught in _run_multi_model_chat_loops), the code still tries to save all responses. Check if state_container.answer_tokens is None for all models and handle appropriately (either raise error or save partial results with clear error messages)
| # Start all model loops in parallel | ||
| threads = [] | ||
| for i, (llm, state_container) in enumerate(zip(llms, state_containers)): | ||
| thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread daemon mode not set - threads should be created with daemon=True to ensure they don't prevent the process from exiting if the main thread terminates unexpectedly
| thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container)) | |
| thread = threading.Thread(target=run_model_loop, args=(i, llm, state_container), daemon=True) |
| finally: | ||
| # Wait for all threads to complete | ||
| for thread in threads: | ||
| thread.join(timeout=5.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread cleanup timeout is too short - 5 second timeout may not be enough for LLM loops to clean up properly, especially if they're mid-generation. Consider increasing to 10-30 seconds or making it configurable. Threads that don't join will leak resources
| # For multi-model chat: list of LLM overrides to compare (exactly 3 required) | ||
| llm_overrides: list[LLMOverride] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation in model - add a validator to ensure that if llm_overrides is provided, it contains exactly 3 elements and llm_override is None (they should be mutually exclusive)
Additional Comments (1)
|
…-model-chat' into sophia/enable-multi-model-chat
Description
This branch adds multi-model chat functionality, allowing users to select up to 3 AI models and receive parallel responses from each. The responses are displayed in a side-by-side comparison view where users can select their preferred response.
How Has This Been Tested?
Manual testing locally
Additional Options
Summary by cubic
Adds multi-model chat end-to-end: run 2–3 models in parallel per message, stream packets tagged by model_index, persist per-model results, and render side-by-side responses in the UI.
New Features
Migration
Written for commit ec1f0a4. Summary will update on new commits.