Skip to content

Conversation

@steebchen
Copy link
Member

@steebchen steebchen commented Sep 22, 2025

Summary

  • Resolves TypeError caused by non-string message content during token encoding
  • Introduces extractTextFromMessageContent utility to handle string and array content types
  • Updates chat message processing to use this utility for consistent string content

Changes

Core Functionality

  • Added extractTextFromMessageContent function in types.ts to extract plain text from message content that can be a string or an array
  • Replaced direct content usage with extractTextFromMessageContent in:
    • chat.ts (message mapping for encoding and token calculation)
    • calculate-prompt-tokens.ts (prompt token calculation)
    • estimate-tokens.ts (token estimation)

Bug Fix

  • Prevents TypeError by ensuring all message contents passed to gpt-tokenizer are strings

Test plan

  • Verify no TypeError occurs when message content is an array
  • Confirm token counts are correctly calculated for various message content formats
  • Test chat completions and streaming with mixed content types
  • Ensure backward compatibility with string-only message content

🌿 Generated by Terry


ℹ️ Tag @terragon-labs to ask questions and address PR feedback

📎 Task: https://www.terragonlabs.com/task/113d7469-8d36-49e1-850f-d5e9c39e6e77

Summary by CodeRabbit

  • New Features
    • Consistent text extraction from mixed/structured message content for all chat flows.
  • Improvements
    • More accurate and uniform token estimation by standardizing content processing.
    • Aligned behavior between initial and streaming responses for message content handling.
  • Bug Fixes
    • Resolved inconsistencies and errors when messages include non-string or structured content.
    • Reduced edge-case failures in chat payload generation and token counting, improving overall stability.

…e content

- Introduced extractTextFromMessageContent function to handle message content
  that can be string or array of message parts.
- Updated chat.ts, calculate-prompt-tokens.ts, and estimate-tokens.ts to use this
  function for consistent content extraction before token encoding.
- This change improves compatibility with gpt-tokenizer which expects string content.
- Added extractTextFromMessageContent implementation in types.ts for reuse.

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@bunnyshell
Copy link

bunnyshell bot commented Sep 22, 2025

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 22, 2025

Walkthrough

Standardizes message content extraction across chat payload building and token estimation by introducing and using a new helper, extractTextFromMessageContent, to consistently derive string content from string or array-based message content. Imports are updated accordingly; no function signatures changed. Token calculation/estimation flows remain the same aside from the unified extraction step.

Changes

Cohort / File(s) Summary
Standardized content extraction usage
apps/gateway/src/chat/chat.ts, apps/gateway/src/chat/tools/calculate-prompt-tokens.ts, apps/gateway/src/chat/tools/estimate-tokens.ts
Replaced inline string/JSON serialization of message.content with extractTextFromMessageContent(m.content) for building ChatMessage payloads in normal and streaming paths, and for token calculation/estimation. Added corresponding imports.
New helper utility
apps/gateway/src/chat/tools/types.ts
Added exported extractTextFromMessageContent(content) that returns a string: direct string passthrough, join of "text" parts from array content, or empty string otherwise.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant G as Gateway (chat.ts)
  participant U as Utility (extractTextFromMessageContent)
  participant O as OpenAI-compatible API

  C->>G: Send messages[]
  loop Build payload
    G->>U: extractTextFromMessageContent(m.content)
    U-->>G: string content
  end
  G->>O: Create chat completion (non-stream/stream)
  O-->>G: Response / Stream chunks
  G-->>C: Response / Stream relay
  note over G,U: Unified extraction for string/array content
Loading
sequenceDiagram
  autonumber
  participant G as Gateway Tools
  participant U as Utility (extractTextFromMessageContent)
  participant T as Tokenizer

  G->>G: map messages -> ChatMessage
  G->>U: extractTextFromMessageContent(m.content)
  U-->>G: string content
  G->>T: encode ChatMessage[]
  T-->>G: token counts / throws
  alt encode error
    G->>G: fallback length-based estimate
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

auto-merge

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "fix(tokenizer): resolve TypeError" succinctly and accurately reflects the PR's primary purpose—preventing a TypeError in token encoding by normalizing message content with the new extractTextFromMessageContent helper—and directly relates to the changes in chat.ts, calculate-prompt-tokens.ts, estimate-tokens.ts, and types.ts. It is concise, scoped to the tokenizer area, and clear enough for a teammate scanning the history.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch terragon/fix-typeerror-line-to-encode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot changed the title Fix TypeError in token encoding by extracting text from message content fix(tokenizer): resolve TypeError Sep 22, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
apps/gateway/src/chat/tools/estimate-tokens.ts (2)

43-49: Fallback undercounts when m.content is an array

.length on an array counts parts, not characters. Use the extractor in the fallback too.

-				calculatedPromptTokens =
-					messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) / 4;
+				calculatedPromptTokens =
+					messages.reduce(
+						(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+						0,
+					) / 4;

55-55: Avoid JSON.stringify for pure strings before tokenizing

content is already a string; stringifying adds quotes/escapes and inflates counts.

-				calculatedCompletionTokens = encode(JSON.stringify(content)).length;
+				calculatedCompletionTokens = encode(content).length;
apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)

22-30: Same fallback issue: count text, not array length

Use the extractor during the length/4 fallback to avoid undercounting with array content.

-				messages.reduce(
-					(acc: number, m: any) => acc + (m.content?.length || 0),
-					0,
-				) / 4,
+				messages.reduce(
+					(acc: number, m: any) =>
+						acc + extractTextFromMessageContent(m.content).length,
+					0,
+				) / 4,
apps/gateway/src/chat/chat.ts (5)

732-737: Fix length/4 fallback to account for array content

Prevent undercounting when messages[*].content is an array.

-				const messageTokens = messages.reduce(
-					(acc, m) => acc + (m.content?.length || 0),
-					0,
-				);
+				const messageTokens = messages.reduce(
+					(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+					0,
+				);

2547-2550: Streaming fallback: same array-content undercount

Mirror the extractor here too.

-							calculatedPromptTokens =
-								messages.reduce((acc, m) => acc + (m.content?.length || 0), 0) /
-								4;
+							calculatedPromptTokens =
+								messages.reduce(
+									(acc, m) => acc + extractTextFromMessageContent(m.content).length,
+									0,
+								) / 4;

2555-2558: Tokenizing completion: avoid JSON.stringify for strings

Use the raw fullContent to keep counts accurate.

-							calculatedCompletionTokens = encode(
-								JSON.stringify(fullContent),
-							).length;
+							calculatedCompletionTokens = encode(fullContent).length;

2655-2658: Cost logging: normalize prompt text for arrays

Improve observability and parity with token math by using the extractor.

-						prompt: messages.map((m) => m.content).join("\n"),
+						prompt: messages
+							.map((m) => extractTextFromMessageContent(m.content))
+							.join("\n"),

3041-3044: Same normalization for non‑streaming cost calculation

Keep both paths consistent.

-			prompt: messages.map((m) => m.content).join("\n"),
+			prompt: messages
+				.map((m) => extractTextFromMessageContent(m.content))
+				.join("\n"),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 28cfd7e and a967f04.

📒 Files selected for processing (4)
  • apps/gateway/src/chat/chat.ts (3 hunks)
  • apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (2 hunks)
  • apps/gateway/src/chat/tools/estimate-tokens.ts (2 hunks)
  • apps/gateway/src/chat/tools/types.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
{apps/api,apps/gateway,apps/ui,apps/docs,packages}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Always use top-level import; never use require() or dynamic imports (e.g., import(), next/dynamic)

Files:

  • apps/gateway/src/chat/tools/estimate-tokens.ts
  • apps/gateway/src/chat/tools/calculate-prompt-tokens.ts
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/tools/types.ts
{apps/api,apps/gateway,packages/db}/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

{apps/api,apps/gateway,packages/db}/**/*.ts: Use Drizzle ORM with the latest object syntax for database access
For reads, use db().query.

.findMany() or db().query.
.findFirst()

Files:

  • apps/gateway/src/chat/tools/estimate-tokens.ts
  • apps/gateway/src/chat/tools/calculate-prompt-tokens.ts
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/tools/types.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx}: Never use any or as any in this TypeScript project unless absolutely necessary
Always use top-level import; never use require or dynamic imports

Files:

  • apps/gateway/src/chat/tools/estimate-tokens.ts
  • apps/gateway/src/chat/tools/calculate-prompt-tokens.ts
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/tools/types.ts
{apps/api,apps/gateway}/src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

For reads, use db().query.

.findMany() or db().query.
.findFirst()

Files:

  • apps/gateway/src/chat/tools/estimate-tokens.ts
  • apps/gateway/src/chat/tools/calculate-prompt-tokens.ts
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/tools/types.ts
🧬 Code graph analysis (3)
apps/gateway/src/chat/tools/estimate-tokens.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
  • extractTextFromMessageContent (15-28)
apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
  • extractTextFromMessageContent (15-28)
apps/gateway/src/chat/chat.ts (1)
apps/gateway/src/chat/tools/types.ts (1)
  • extractTextFromMessageContent (15-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: e2e-shards (3)
  • GitHub Check: e2e-shards (2)
  • GitHub Check: e2e-shards (5)
  • GitHub Check: e2e-shards (4)
  • GitHub Check: e2e-shards (1)
  • GitHub Check: build / run
  • GitHub Check: lint / run
  • GitHub Check: generate / run
  • GitHub Check: test / run
  • GitHub Check: autofix
🔇 Additional comments (5)
apps/gateway/src/chat/tools/estimate-tokens.ts (1)

33-36: Good swap to centralized extractor for encodeChat

Using extractTextFromMessageContent here removes the TypeError surface.

apps/gateway/src/chat/tools/calculate-prompt-tokens.ts (1)

15-19: LGTM on extracting text for encodeChat

Centralization reduces edge cases and aligns with the PR goal.

apps/gateway/src/chat/chat.ts (3)

57-61: LGTM on importing the new extractor

Consistent usage across the file is the right move.


721-726: LGTM on prompt token estimation input

Passing extractTextFromMessageContent(m.content) into encodeChat fixes the original crash vector.


2532-2536: LGTM: extractor used for streaming encodeChat

Good parity with non-streaming path.

Comment on lines +15 to +28
export function extractTextFromMessageContent(content: string | any[]): string {
if (typeof content === "string") {
return content;
}

if (Array.isArray(content)) {
return content
.filter((part: any) => part.type === "text")
.map((part: any) => part.text || "")
.join(" ");
}

return "";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace any with precise content-part types; tighten signature and add safe narrowing

Project guideline forbids any. Define a typed union for message parts and use a type guard. Also handle null/undefined and normalize spacing.

 export const DEFAULT_TOKENIZER_MODEL = "gpt-4";
 
 // Define ChatMessage type to match what gpt-tokenizer expects
 export interface ChatMessage {
   role: "user" | "system" | "assistant" | undefined;
   content: string;
   name?: string;
 }
 
+// Message content parts (mirror zod schema in chat.ts)
+export interface TextPart {
+  type: "text";
+  text: string;
+}
+export interface ImageUrlPart {
+  type: "image_url";
+  image_url: { url: string; detail?: "low" | "high" | "auto" };
+}
+export type MessageContentPart = TextPart | ImageUrlPart;
+
 /**
  * Extracts text content from a message content field, handling both string and array formats
  * This function is necessary because BaseMessage.content can be string | MessageContent[]
  * but gpt-tokenizer expects only strings
  */
-export function extractTextFromMessageContent(content: string | any[]): string {
-	if (typeof content === "string") {
-		return content;
-	}
-
-	if (Array.isArray(content)) {
-		return content
-			.filter((part: any) => part.type === "text")
-			.map((part: any) => part.text || "")
-			.join(" ");
-	}
-
-	return "";
-}
+export function extractTextFromMessageContent(
+  content: string | MessageContentPart[] | null | undefined,
+): string {
+  if (typeof content === "string") return content;
+  if (Array.isArray(content)) {
+    return content
+      .filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string")
+      .map((part) => part.text)
+      .join(" ")
+      .replace(/\s+/g, " ")
+      .trim();
+  }
+  return "";
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function extractTextFromMessageContent(content: string | any[]): string {
if (typeof content === "string") {
return content;
}
if (Array.isArray(content)) {
return content
.filter((part: any) => part.type === "text")
.map((part: any) => part.text || "")
.join(" ");
}
return "";
}
export interface TextPart {
type: "text";
text: string;
}
export interface ImageUrlPart {
type: "image_url";
image_url: { url: string; detail?: "low" | "high" | "auto" };
}
export type MessageContentPart = TextPart | ImageUrlPart;
export function extractTextFromMessageContent(
content: string | MessageContentPart[] | null | undefined,
): string {
if (typeof content === "string") return content;
if (Array.isArray(content)) {
return content
.filter((part): part is TextPart => part.type === "text" && typeof (part as TextPart).text === "string")
.map((part) => part.text)
.join(" ")
.replace(/\s+/g, " ")
.trim();
}
return "";
}
🤖 Prompt for AI Agents
In apps/gateway/src/chat/tools/types.ts around lines 15 to 28, replace the use
of `any` and the loose signature by defining a MessagePart union type (e.g.,
TextPart { type: "text"; text?: string | null }, ImagePart | OtherPart as
needed), change the function signature to accept string | MessagePart[] | null |
undefined, add a type guard isTextPart(part): part is TextPart to narrow safely,
filter out null/undefined and only text parts, map to (part.text ?? "") and
normalize whitespace by trimming parts and joining with a single space, and
return an empty string for other inputs; ensure no `any` remains and the types
are exported if reused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants