fix(tokenizer): resolve TypeError #914

steebchen · 2025-09-22T20:33:42Z

Summary

Resolves TypeError caused by non-string message content during token encoding
Introduces extractTextFromMessageContent utility to handle string and array content types
Updates chat message processing to use this utility for consistent string content

Changes

Core Functionality

Added extractTextFromMessageContent function in types.ts to extract plain text from message content that can be a string or an array
Replaced direct content usage with extractTextFromMessageContent in:
- chat.ts (message mapping for encoding and token calculation)
- calculate-prompt-tokens.ts (prompt token calculation)
- estimate-tokens.ts (token estimation)

Bug Fix

Prevents TypeError by ensuring all message contents passed to gpt-tokenizer are strings

Test plan

Verify no TypeError occurs when message content is an array
Confirm token counts are correctly calculated for various message content formats
Test chat completions and streaming with mixed content types
Ensure backward compatibility with string-only message content

🌿 Generated by Terry

ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/113d7469-8d36-49e1-850f-d5e9c39e6e77

Summary by CodeRabbit

New Features
- Consistent text extraction from mixed/structured message content for all chat flows.
Improvements
- More accurate and uniform token estimation by standardizing content processing.
- Aligned behavior between initial and streaming responses for message content handling.
Bug Fixes
- Resolved inconsistencies and errors when messages include non-string or structured content.
- Reduced edge-case failures in chat payload generation and token counting, improving overall stability.

…e content - Introduced extractTextFromMessageContent function to handle message content that can be string or array of message parts. - Updated chat.ts, calculate-prompt-tokens.ts, and estimate-tokens.ts to use this function for consistent content extraction before token encoding. - This change improves compatibility with gpt-tokenizer which expects string content. - Added extractTextFromMessageContent implementation in types.ts for reuse. Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>

bunnyshell · 2025-09-22T20:33:49Z

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

🚀 /bns:deploy to deploy the environment

coderabbitai · 2025-09-22T20:33:49Z

Walkthrough

Standardizes message content extraction across chat payload building and token estimation by introducing and using a new helper, extractTextFromMessageContent, to consistently derive string content from string or array-based message content. Imports are updated accordingly; no function signatures changed. Token calculation/estimation flows remain the same aside from the unified extraction step.

Changes

Cohort / File(s)	Summary
Standardized content extraction usage `apps/gateway/src/chat/chat.ts`, `apps/gateway/src/chat/tools/calculate-prompt-tokens.ts`, `apps/gateway/src/chat/tools/estimate-tokens.ts`	Replaced inline string/JSON serialization of message.content with `extractTextFromMessageContent(m.content)` for building ChatMessage payloads in normal and streaming paths, and for token calculation/estimation. Added corresponding imports.
New helper utility `apps/gateway/src/chat/tools/types.ts`	Added exported `extractTextFromMessageContent(content)` that returns a string: direct string passthrough, join of "text" parts from array content, or empty string otherwise.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant G as Gateway (chat.ts)
  participant U as Utility (extractTextFromMessageContent)
  participant O as OpenAI-compatible API

  C->>G: Send messages[]
  loop Build payload
    G->>U: extractTextFromMessageContent(m.content)
    U-->>G: string content
  end
  G->>O: Create chat completion (non-stream/stream)
  O-->>G: Response / Stream chunks
  G-->>C: Response / Stream relay
  note over G,U: Unified extraction for string/array content

sequenceDiagram
  autonumber
  participant G as Gateway Tools
  participant U as Utility (extractTextFromMessageContent)
  participant T as Tokenizer

  G->>G: map messages -> ChatMessage
  G->>U: extractTextFromMessageContent(m.content)
  U-->>G: string content
  G->>T: encode ChatMessage[]
  T-->>G: token counts / throws
  alt encode error
    G->>G: fallback length-based estimate
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

fix(chat): use estimateTokens helper for chat token calculation #537 — Also modifies chat token-estimation flow and touches chat.ts and token calculation files; closely related to standardizing content extraction used in those paths.

Suggested labels

auto-merge

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "fix(tokenizer): resolve TypeError" succinctly and accurately reflects the PR's primary purpose—preventing a TypeError in token encoding by normalizing message content with the new extractTextFromMessageContent helper—and directly relates to the changes in chat.ts, calculate-prompt-tokens.ts, estimate-tokens.ts, and types.ts. It is concise, scoped to the tokenizer area, and clear enough for a teammate scanning the history.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch terragon/fix-typeerror-line-to-encode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (8)

apps/gateway/src/chat/tools/estimate-tokens.ts (2)
43-49: Fallback undercounts when m.content is an array

.length on an array counts parts, not characters. Use the extractor in the fallback too.
-				calculatedPromptTokens =
-					messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) / 4;
+				calculatedPromptTokens =
+					messages.reduce(
+						(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+						0,
+					) / 4;
55-55: Avoid JSON.stringify for pure strings before tokenizing

content is already a string; stringifying adds quotes/escapes and inflates counts.
-				calculatedCompletionTokens = encode(JSON.stringify(content)).length;
+				calculatedCompletionTokens = encode(content).length;
apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)
22-30: Same fallback issue: count text, not array length

Use the extractor during the length/4 fallback to avoid undercounting with array content.
-				messages.reduce(
-					(acc: number, m: any) => acc + (m.content?.length || 0),
-					0,
-				) / 4,
+				messages.reduce(
+					(acc: number, m: any) =>
+						acc + extractTextFromMessageContent(m.content).length,
+					0,
+				) / 4,
apps/gateway/src/chat/chat.ts (5)
732-737: Fix length/4 fallback to account for array content

Prevent undercounting when messages[*].content is an array.
-				const messageTokens = messages.reduce(
-					(acc, m) => acc + (m.content?.length || 0),
-					0,
-				);
+				const messageTokens = messages.reduce(
+					(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+					0,
+				);
2547-2550: Streaming fallback: same array-content undercount

Mirror the extractor here too.
-							calculatedPromptTokens =
-								messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) /
-								4;
+							calculatedPromptTokens =
+								messages.reduce(
+									(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+									0,
+								) / 4;
2555-2558: Tokenizing completion: avoid JSON.stringify for strings

Use the raw fullContent to keep counts accurate.
-							calculatedCompletionTokens = encode(
-								JSON.stringify(fullContent),
-							).length;
+							calculatedCompletionTokens = encode(fullContent).length;
2655-2658: Cost logging: normalize prompt text for arrays

Improve observability and parity with token math by using the extractor.
-						prompt: messages.map((m) => m.content).join("\n"),
+						prompt: messages
+							.map((m) => extractTextFromMessageContent(m.content))
+							.join("\n"),
3041-3044: Same normalization for non‑streaming cost calculation

Keep both paths consistent.
-			prompt: messages.map((m) => m.content).join("\n"),
+			prompt: messages
+				.map((m) => extractTextFromMessageContent(m.content))
+				.join("\n"),

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 28cfd7e and a967f04.

📒 Files selected for processing (4)

apps/gateway/src/chat/chat.ts (3 hunks)
apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (2 hunks)
apps/gateway/src/chat/tools/estimate-tokens.ts (2 hunks)
apps/gateway/src/chat/tools/types.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

{apps/api,apps/gateway,apps/ui,apps/docs,packages}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Always use top-level import; never use require() or dynamic imports (e.g., import(), next/dynamic)

Files:

apps/gateway/src/chat/tools/estimate-tokens.ts
apps/gateway/src/chat/tools/calculate-prompt-tokens.ts
apps/gateway/src/chat/chat.ts
apps/gateway/src/chat/tools/types.ts

{apps/api,apps/gateway,packages/db}/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

{apps/api,apps/gateway,packages/db}/**/*.ts: Use Drizzle ORM with the latest object syntax for database access
For reads, use db().query.
.findMany() or db().query.
.findFirst()
Files:

apps/gateway/src/chat/tools/estimate-tokens.ts

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts

apps/gateway/src/chat/chat.ts

apps/gateway/src/chat/tools/types.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: Never use any or as any in this TypeScript project unless absolutely necessary
Always use top-level import; never use require or dynamic imports

Files:

apps/gateway/src/chat/tools/estimate-tokens.ts

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts

apps/gateway/src/chat/chat.ts

apps/gateway/src/chat/tools/types.ts

{apps/api,apps/gateway}/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

For reads, use db().query.

.findMany() or db().query.
.findFirst()
Files:

apps/gateway/src/chat/tools/estimate-tokens.ts

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts

apps/gateway/src/chat/chat.ts

apps/gateway/src/chat/tools/types.ts

🧬 Code graph analysis (3)

apps/gateway/src/chat/tools/estimate-tokens.ts (1)

apps/gateway/src/chat/tools/types.ts (1)

extractTextFromMessageContent (15-28)

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)

apps/gateway/src/chat/tools/types.ts (1)

extractTextFromMessageContent (15-28)

apps/gateway/src/chat/chat.ts (1)

apps/gateway/src/chat/tools/types.ts (1)

extractTextFromMessageContent (15-28)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: e2e-shards (3)

GitHub Check: e2e-shards (2)

GitHub Check: e2e-shards (5)

GitHub Check: e2e-shards (4)

GitHub Check: e2e-shards (1)

GitHub Check: build / run

GitHub Check: lint / run

GitHub Check: generate / run

GitHub Check: test / run

GitHub Check: autofix

🔇 Additional comments (5)

apps/gateway/src/chat/tools/estimate-tokens.ts (1)

33-36: Good swap to centralized extractor for encodeChat

Using extractTextFromMessageContent here removes the TypeError surface.

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)

15-19: LGTM on extracting text for encodeChat

Centralization reduces edge cases and aligns with the PR goal.

apps/gateway/src/chat/chat.ts (3)

57-61: LGTM on importing the new extractor

Consistent usage across the file is the right move.

721-726: LGTM on prompt token estimation input

Passing extractTextFromMessageContent(m.content) into encodeChat fixes the original crash vector.

2532-2536: LGTM: extractor used for streaming encodeChat

Good parity with non-streaming path.

coderabbitai · 2025-09-22T20:39:35Z

apps/gateway/src/chat/tools/types.ts

+export function extractTextFromMessageContent(content: string | any[]): string {
+	if (typeof content === "string") {
+		return content;
+	}
+
+	if (Array.isArray(content)) {
+		return content
+			.filter((part: any) => part.type === "text")
+			.map((part: any) => part.text || "")
+			.join(" ");
+	}
+
+	return "";
+}


⚠️ Potential issue

Replace any with precise content-part types; tighten signature and add safe narrowing

Project guideline forbids any. Define a typed union for message parts and use a type guard. Also handle null/undefined and normalize spacing.

export const DEFAULT_TOKENIZER_MODEL = "gpt-4"; // Define ChatMessage type to match what gpt-tokenizer expects export interface ChatMessage { role: "user" | "system" | "assistant" | undefined; content: string; name?: string; } +// Message content parts (mirror zod schema in chat.ts) +export interface TextPart { + type: "text"; + text: string; +} +export interface ImageUrlPart { + type: "image_url"; + image_url: { url: string; detail?: "low" | "high" | "auto" }; +} +export type MessageContentPart = TextPart | ImageUrlPart; + /** * Extracts text content from a message content field, handling both string and array formats * This function is necessary because BaseMessage.content can be string | MessageContent[] * but gpt-tokenizer expects only strings */ -export function extractTextFromMessageContent(content: string | any[]): string { - if (typeof content === "string") { - return content; - } - - if (Array.isArray(content)) { - return content - .filter((part: any) => part.type === "text") - .map((part: any) => part.text || "") - .join(" "); - } - - return ""; -} +export function extractTextFromMessageContent( + content: string | MessageContentPart[] | null | undefined, +): string { + if (typeof content === "string") return content; + if (Array.isArray(content)) { + return content + .filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string") + .map((part) => part.text) + .join(" ") + .replace(/\s+/g, " ") + .trim(); + } + return ""; +}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export function extractTextFromMessageContent(content: string | any[]): string {

if (typeof content === "string") {

return content;

}

if (Array.isArray(content)) {

return content

.filter((part: any) => part.type === "text")

.map((part: any) => part.text || "")

.join(" ");

}

return "";

}

export interface TextPart {

type: "text";

text: string;

}

export interface ImageUrlPart {

type: "image_url";

image_url: { url: string; detail?: "low" | "high" | "auto" };

}

export type MessageContentPart = TextPart | ImageUrlPart;

export function extractTextFromMessageContent(

content: string | MessageContentPart[] | null | undefined,

): string {

if (typeof content === "string") return content;

if (Array.isArray(content)) {

return content

.filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string")

.map((part) => part.text)

.join(" ")

.replace(/\s+/g, " ")

.trim();

}

return "";

}

🤖 Prompt for AI Agents

In apps/gateway/src/chat/tools/types.ts around lines 15 to 28, replace the use of `any` and the loose signature by defining a MessagePart union type (e.g., TextPart { type: "text"; text?: string | null }, ImagePart | OtherPart as needed), change the function signature to accept string | MessagePart[] | null | undefined, add a type guard isTextPart(part): part is TextPart to narrow safely, filter out null/undefined and only text parts, map to (part.text ?? "") and normalize whitespace by trimming parts and joining with a single space, and return an empty string for other inputs; ensure no `any` remains and the types are exported if reused.

github-actions bot changed the title ~~Fix TypeError in token encoding by extracting text from message content~~ fix(tokenizer): resolve TypeError Sep 22, 2025

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tokenizer): resolve TypeError #914

fix(tokenizer): resolve TypeError #914

steebchen commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

bunnyshell bot commented Sep 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 22, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(tokenizer): resolve TypeError #914

Are you sure you want to change the base?

fix(tokenizer): resolve TypeError #914

Conversation

steebchen commented Sep 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Functionality

Bug Fix

Test plan

Summary by CodeRabbit

Uh oh!

bunnyshell bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Preview Environment deleted from Bunnyshell

Uh oh!

coderabbitai bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steebchen commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

bunnyshell bot commented Sep 22, 2025 •

edited

Loading

coderabbitai bot commented Sep 22, 2025 •

edited

Loading