-
Notifications
You must be signed in to change notification settings - Fork 68
fix(tokenizer): resolve TypeError #914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e content - Introduced extractTextFromMessageContent function to handle message content that can be string or array of message parts. - Updated chat.ts, calculate-prompt-tokens.ts, and estimate-tokens.ts to use this function for consistent content extraction before token encoding. - This change improves compatibility with gpt-tokenizer which expects string content. - Added extractTextFromMessageContent implementation in types.ts for reuse. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
❌ Preview Environment deleted from BunnyshellAvailable commands (reply to this comment):
|
WalkthroughStandardizes message content extraction across chat payload building and token estimation by introducing and using a new helper, extractTextFromMessageContent, to consistently derive string content from string or array-based message content. Imports are updated accordingly; no function signatures changed. Token calculation/estimation flows remain the same aside from the unified extraction step. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant G as Gateway (chat.ts)
participant U as Utility (extractTextFromMessageContent)
participant O as OpenAI-compatible API
C->>G: Send messages[]
loop Build payload
G->>U: extractTextFromMessageContent(m.content)
U-->>G: string content
end
G->>O: Create chat completion (non-stream/stream)
O-->>G: Response / Stream chunks
G-->>C: Response / Stream relay
note over G,U: Unified extraction for string/array content
sequenceDiagram
autonumber
participant G as Gateway Tools
participant U as Utility (extractTextFromMessageContent)
participant T as Tokenizer
G->>G: map messages -> ChatMessage
G->>U: extractTextFromMessageContent(m.content)
U-->>G: string content
G->>T: encode ChatMessage[]
T-->>G: token counts / throws
alt encode error
G->>G: fallback length-based estimate
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
apps/gateway/src/chat/tools/estimate-tokens.ts (2)
43-49: Fallback undercounts whenm.contentis an array
.lengthon an array counts parts, not characters. Use the extractor in the fallback too.- calculatedPromptTokens = - messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) / 4; + calculatedPromptTokens = + messages.reduce( + (acc, m) => acc + extractTextFromMessageContent(m.content).length, + 0, + ) / 4;
55-55: Avoid JSON.stringify for pure strings before tokenizing
contentis already a string; stringifying adds quotes/escapes and inflates counts.- calculatedCompletionTokens = encode(JSON.stringify(content)).length; + calculatedCompletionTokens = encode(content).length;apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)
22-30: Same fallback issue: count text, not array lengthUse the extractor during the length/4 fallback to avoid undercounting with array content.
- messages.reduce( - (acc: number, m: any) => acc + (m.content?.length || 0), - 0, - ) / 4, + messages.reduce( + (acc: number, m: any) => + acc + extractTextFromMessageContent(m.content).length, + 0, + ) / 4,apps/gateway/src/chat/chat.ts (5)
732-737: Fix length/4 fallback to account for array contentPrevent undercounting when
messages[*].contentis an array.- const messageTokens = messages.reduce( - (acc, m) => acc + (m.content?.length || 0), - 0, - ); + const messageTokens = messages.reduce( + (acc, m) => acc + extractTextFromMessageContent(m.content).length, + 0, + );
2547-2550: Streaming fallback: same array-content undercountMirror the extractor here too.
- calculatedPromptTokens = - messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) / - 4; + calculatedPromptTokens = + messages.reduce( + (acc, m) => acc + extractTextFromMessageContent(m.content).length, + 0, + ) / 4;
2555-2558: Tokenizing completion: avoid JSON.stringify for stringsUse the raw
fullContentto keep counts accurate.- calculatedCompletionTokens = encode( - JSON.stringify(fullContent), - ).length; + calculatedCompletionTokens = encode(fullContent).length;
2655-2658: Cost logging: normalize prompt text for arraysImprove observability and parity with token math by using the extractor.
- prompt: messages.map((m) => m.content).join("\n"), + prompt: messages + .map((m) => extractTextFromMessageContent(m.content)) + .join("\n"),
3041-3044: Same normalization for non‑streaming cost calculationKeep both paths consistent.
- prompt: messages.map((m) => m.content).join("\n"), + prompt: messages + .map((m) => extractTextFromMessageContent(m.content)) + .join("\n"),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
apps/gateway/src/chat/chat.ts(3 hunks)apps/gateway/src/chat/tools/calculate-prompt-tokens.ts(2 hunks)apps/gateway/src/chat/tools/estimate-tokens.ts(2 hunks)apps/gateway/src/chat/tools/types.ts(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{apps/api,apps/gateway,apps/ui,apps/docs,packages}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Always use top-level import; never use require() or dynamic imports (e.g., import(), next/dynamic)
Files:
apps/gateway/src/chat/tools/estimate-tokens.tsapps/gateway/src/chat/tools/calculate-prompt-tokens.tsapps/gateway/src/chat/chat.tsapps/gateway/src/chat/tools/types.ts
{apps/api,apps/gateway,packages/db}/**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
.findMany() or db().query.
{apps/api,apps/gateway,packages/db}/**/*.ts: Use Drizzle ORM with the latest object syntax for database access
For reads, use db().query..findFirst() Files:
apps/gateway/src/chat/tools/estimate-tokens.tsapps/gateway/src/chat/tools/calculate-prompt-tokens.tsapps/gateway/src/chat/chat.tsapps/gateway/src/chat/tools/types.ts**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.{ts,tsx}: Never useanyoras anyin this TypeScript project unless absolutely necessary
Always use top-levelimport; never userequireor dynamic importsFiles:
apps/gateway/src/chat/tools/estimate-tokens.tsapps/gateway/src/chat/tools/calculate-prompt-tokens.tsapps/gateway/src/chat/chat.tsapps/gateway/src/chat/tools/types.ts{apps/api,apps/gateway}/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
For reads, use db().query.
.findMany() or db().query.
.findFirst() Files:
apps/gateway/src/chat/tools/estimate-tokens.tsapps/gateway/src/chat/tools/calculate-prompt-tokens.tsapps/gateway/src/chat/chat.tsapps/gateway/src/chat/tools/types.ts🧬 Code graph analysis (3)
apps/gateway/src/chat/tools/estimate-tokens.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
extractTextFromMessageContent(15-28)apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
extractTextFromMessageContent(15-28)apps/gateway/src/chat/chat.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
extractTextFromMessageContent(15-28)⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: e2e-shards (3)
- GitHub Check: e2e-shards (2)
- GitHub Check: e2e-shards (5)
- GitHub Check: e2e-shards (4)
- GitHub Check: e2e-shards (1)
- GitHub Check: build / run
- GitHub Check: lint / run
- GitHub Check: generate / run
- GitHub Check: test / run
- GitHub Check: autofix
🔇 Additional comments (5)
apps/gateway/src/chat/tools/estimate-tokens.ts (1)
33-36: Good swap to centralized extractor for encodeChatUsing
extractTextFromMessageContenthere removes the TypeError surface.apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)
15-19: LGTM on extracting text for encodeChatCentralization reduces edge cases and aligns with the PR goal.
apps/gateway/src/chat/chat.ts (3)
57-61: LGTM on importing the new extractorConsistent usage across the file is the right move.
721-726: LGTM on prompt token estimation inputPassing
extractTextFromMessageContent(m.content)intoencodeChatfixes the original crash vector.
2532-2536: LGTM: extractor used for streaming encodeChatGood parity with non-streaming path.
| export function extractTextFromMessageContent(content: string | any[]): string { | ||
| if (typeof content === "string") { | ||
| return content; | ||
| } | ||
|
|
||
| if (Array.isArray(content)) { | ||
| return content | ||
| .filter((part: any) => part.type === "text") | ||
| .map((part: any) => part.text || "") | ||
| .join(" "); | ||
| } | ||
|
|
||
| return ""; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace any with precise content-part types; tighten signature and add safe narrowing
Project guideline forbids any. Define a typed union for message parts and use a type guard. Also handle null/undefined and normalize spacing.
export const DEFAULT_TOKENIZER_MODEL = "gpt-4";
// Define ChatMessage type to match what gpt-tokenizer expects
export interface ChatMessage {
role: "user" | "system" | "assistant" | undefined;
content: string;
name?: string;
}
+// Message content parts (mirror zod schema in chat.ts)
+export interface TextPart {
+ type: "text";
+ text: string;
+}
+export interface ImageUrlPart {
+ type: "image_url";
+ image_url: { url: string; detail?: "low" | "high" | "auto" };
+}
+export type MessageContentPart = TextPart | ImageUrlPart;
+
/**
* Extracts text content from a message content field, handling both string and array formats
* This function is necessary because BaseMessage.content can be string | MessageContent[]
* but gpt-tokenizer expects only strings
*/
-export function extractTextFromMessageContent(content: string | any[]): string {
- if (typeof content === "string") {
- return content;
- }
-
- if (Array.isArray(content)) {
- return content
- .filter((part: any) => part.type === "text")
- .map((part: any) => part.text || "")
- .join(" ");
- }
-
- return "";
-}
+export function extractTextFromMessageContent(
+ content: string | MessageContentPart[] | null | undefined,
+): string {
+ if (typeof content === "string") return content;
+ if (Array.isArray(content)) {
+ return content
+ .filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string")
+ .map((part) => part.text)
+ .join(" ")
+ .replace(/\s+/g, " ")
+ .trim();
+ }
+ return "";
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function extractTextFromMessageContent(content: string | any[]): string { | |
| if (typeof content === "string") { | |
| return content; | |
| } | |
| if (Array.isArray(content)) { | |
| return content | |
| .filter((part: any) => part.type === "text") | |
| .map((part: any) => part.text || "") | |
| .join(" "); | |
| } | |
| return ""; | |
| } | |
| export interface TextPart { | |
| type: "text"; | |
| text: string; | |
| } | |
| export interface ImageUrlPart { | |
| type: "image_url"; | |
| image_url: { url: string; detail?: "low" | "high" | "auto" }; | |
| } | |
| export type MessageContentPart = TextPart | ImageUrlPart; | |
| export function extractTextFromMessageContent( | |
| content: string | MessageContentPart[] | null | undefined, | |
| ): string { | |
| if (typeof content === "string") return content; | |
| if (Array.isArray(content)) { | |
| return content | |
| .filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string") | |
| .map((part) => part.text) | |
| .join(" ") | |
| .replace(/\s+/g, " ") | |
| .trim(); | |
| } | |
| return ""; | |
| } |
🤖 Prompt for AI Agents
In apps/gateway/src/chat/tools/types.ts around lines 15 to 28, replace the use
of `any` and the loose signature by defining a MessagePart union type (e.g.,
TextPart { type: "text"; text?: string | null }, ImagePart | OtherPart as
needed), change the function signature to accept string | MessagePart[] | null |
undefined, add a type guard isTextPart(part): part is TextPart to narrow safely,
filter out null/undefined and only text parts, map to (part.text ?? "") and
normalize whitespace by trimming parts and joining with a single space, and
return an empty string for other inputs; ensure no `any` remains and the types
are exported if reused.
Summary
extractTextFromMessageContentutility to handle string and array content typesChanges
Core Functionality
extractTextFromMessageContentfunction intypes.tsto extract plain text from message content that can be a string or an arrayextractTextFromMessageContentin:chat.ts(message mapping for encoding and token calculation)calculate-prompt-tokens.ts(prompt token calculation)estimate-tokens.ts(token estimation)Bug Fix
TypeErrorby ensuring all message contents passed togpt-tokenizerare stringsTest plan
🌿 Generated by Terry
ℹ️ Tag @terragon-labs to ask questions and address PR feedback
📎 Task: https://www.terragonlabs.com/task/113d7469-8d36-49e1-850f-d5e9c39e6e77
Summary by CodeRabbit