Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify chat message streaming in chat template #6120

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MackinnonBuck
Copy link
Member

@MackinnonBuck MackinnonBuck commented Mar 15, 2025

This PR is a proposal for improved handling of handle chat message streaming in the chat template.

Recently, the IChatClient interface was updated to discourage chat client implementations from manipulating the provided chat history. This created two problems for the chat template:

  1. The chat template was relying on messages containing FunctionInvocationContent being automatically added to the chat history by FunctionInvokingChatClient. This no longer happens, so the chat template now has to manually augment its chat history with those messages while reading the streaming response. Discussion here.
  2. However, chat clients may not expect the caller to manipulate the message list while producing a streaming response. This requires the chat template to clone the message list before passing it to the chat client. Discussion here.

Problem 2 goes away if the chat template stores in-progress messages separately from the "committed" chat history. It just renders two message lists in the UI: the in-progress list and the committed list.

Problem 1 then goes away if you change the in-progress list to store ChatResponseUpdates directly rather than mapping them to ChatMessages. When streaming ends, the second list gets added to the first list in the form of chat messages. Separate UI logic can decide how to display content from ChatResponseUpdates. This is the approach that this PR demonstrates.

Microsoft Reviewers: Open in CodeFlow

@MackinnonBuck MackinnonBuck self-assigned this Mar 15, 2025
@MackinnonBuck MackinnonBuck requested a review from a team as a code owner March 15, 2025 02:03
@github-actions github-actions bot added the area-ai-templates Microsoft.Extensions.AI.Templates label Mar 15, 2025
@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Mar 17, 2025

Thanks @MackinnonBuck!

This is certainly an improvement to clarity in some respects, however if I'm understanding it correctly, it also appears to come at a significant perf cost. Perhaps it's possible to resolve that while retaining some of the clarity improvements?

The previous approach used an admittedly unusual pub-sub mechanism so that, for each streaming chunk, only the single ChatMessageItem displaying that chunk would re-render. Everything else was left alone. But in the approach in this PR, every incoming chunk causes the whole existing conversation to re-render in full, producing an O(N^2) effect on the number of renders.

I experimented by starting with the message "what's in the kit" and then clicking the first suggestion until I produced a conversation with 10 user messages. I think this is a realistic length of conversation, producing 1000-2000 streaming chunks.

  • With the previous approach, ChatMessageItem rendered 1284 times
  • With the proposed approach, ChatMessageItem rendered 25,696 times (and this grows quadratically as the conversation grows)

Maybe it would be possible to get some of the benefits of the new approach without the drawbacks by inlining ChatAssistantContentItem back into ChatMessageItem (so that there's no need for differentiating Content and Text parameters there) and retaining the pub-sub signalling from Chat to ChatMessageItem, but retaining the new mechanism of tracking currentResponseUpdates separately from the committed history.

@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Mar 17, 2025

To be honest, even the previous approach's 1284 renders (for a 10-user-message conversation) is pushing at the boundaries, given that ChatMessageItem isn't completely trivial to render and diff. We may want to at least think through what approach we'd use if we were determined to minimize the expense of this. For example:

  • We could have the <assistant-message> part be an independent component and be the subscriber in the pub-sub system, so we literally only re-render that single element per-chunk, hence the rendering and diffing overhead drops to roughly zero and the whole cost is more or less the same as just sending a websocket message from server to client on each chunk
  • Or, we could bypass Blazor rendering entirely on a per-chunk basis, and have it use JS interop to trigger an update to the text inside the <assistant-message>. I think eShopSupport does that, or did at some point.

Of course, this has to be balanced against the need for this template to work for newcomers, and hence limit the amount of code and sophistication in the rendering mechanism. Not saying a more general redesign should happen in this PR, just tracking thoughts about it.

@MackinnonBuck
Copy link
Member Author

Thanks for the insight, @SteveSandersonMS!

This is certainly an improvement to clarity in some respects, however if I'm understanding it correctly, it also appears to come at a significant perf cost. Perhaps it's possible to resolve that while retaining some of the clarity improvements?

The previous approach used an admittedly unusual pub-sub mechanism so that, for each streaming chunk, only the single ChatMessageItem displaying that chunk would re-render. Everything else was left alone. But in the approach in this PR, every incoming chunk causes the whole existing conversation to re-render in full, producing an O(N^2) effect on the number of renders.

Yeah, this was something that crossed my mind when making these changes, but it was getting a little late by the time I put out the PR and didn't get around to thinking about it further. Perhaps I should have called that out. I appreciate that you took some measurements to quantify the perf difference, and I agree this should be addressed before we take this change (if we decide to do so).

The JS interop approach sounds interesting - I might try out something like that, and if it turns out to be too sophisticated for template code, I'll add back the pub/sub mechanism that existed previously. However, I probably won't get to that for a couple days while addressing other high-priority items.

@MackinnonBuck MackinnonBuck force-pushed the mbuck/simplify-chat-streaming branch from 8929016 to f71526d Compare March 17, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ai-templates Microsoft.Extensions.AI.Templates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants