You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When using the "fast" option for citation_quality in the RAG flow with the Cohere API, Unicode errors occur, but only for certain characters when they are split across multiple tokens in a streaming response. This causes multi-byte characters like ü, ö, ä to be incorrectly displayed as the Unicode replacement character (�), even though the same characters are displayed correctly elsewhere in the response. This issue disappears when citation_quality is set to "accurate".
As shown, characters like ö, ü, and ä are split between messages, causing them to be replaced with \ufffd, which represents an invalid character or decoding error.
Expected Behavior
Special characters should be handled and displayed correctly, even when split across tokens in a streaming response, regardless of the citation_quality setting.
Actual Behavior
When citation_quality is set to "fast", characters that are split across tokens (especially multi-byte characters like ü, ö, ä) are incorrectly displayed as the Unicode replacement character (� or \ufffd).
Screenshots N/A
Workaround
Setting citation_quality to "accurate" resolves the issue, but at the cost of performance.
The text was updated successfully, but these errors were encountered:
We're also experiencing this issue, but only after using the Cohere model in the Azure marketplace.
Before we were using the model directly from Cohere, and I remember we did have unicode errors as described in the beginning, but the latest version was fine (in Fast mode).
It's also worth noting that the Citation start and end ranges have an offset, once the above bug is encountered.
SDK Version (required)
5.9.1
Describe the bug
When using the
"fast"
option forcitation_quality
in the RAG flow with the Cohere API, Unicode errors occur, but only for certain characters when they are split across multiple tokens in a streaming response. This causes multi-byte characters likeü
,ö
,ä
to be incorrectly displayed as the Unicode replacement character (�
), even though the same characters are displayed correctly elsewhere in the response. This issue disappears whencitation_quality
is set to"accurate"
.Examples:
Incorrect encoding in
citation_quality: "fast"
Lösungen und beschäftigen:
"Lösungen und beschäftigen"
Körpertemperatur:
"Körpertemperatur"
Frühsommer:
"Frühsommer"
As shown, characters like
ö
,ü
, andä
are split between messages, causing them to be replaced with\ufffd
, which represents an invalid character or decoding error.Expected Behavior
Special characters should be handled and displayed correctly, even when split across tokens in a streaming response, regardless of the
citation_quality
setting.Actual Behavior
When
citation_quality
is set to"fast"
, characters that are split across tokens (especially multi-byte characters likeü
,ö
,ä
) are incorrectly displayed as the Unicode replacement character (�
or\ufffd
).Screenshots
N/A
Workaround
Setting
citation_quality
to"accurate"
resolves the issue, but at the cost of performance.The text was updated successfully, but these errors were encountered: