Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct Chinese and special characters display in HTML renderer #305

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wangyinyuan
Copy link

Issue Description

The current HTML renderer has issues with displaying non-ASCII characters (like Chinese, Japanese, Korean) correctly. This is because:

  1. The original code only applies character encoding handling when encoding is explicitly specified in the data URL
  2. When there's no explicit charset in the data URL, it skips the decoding process entirely, leading to garbled characters

Root Cause

The issue occurs because Base64-encoded HTML content needs proper character encoding handling regardless of whether the charset is explicitly specified. When atob() decodes Base64 content, it returns a string of bytes using Latin1 encoding, which needs to be properly decoded using the correct charset. For more information about Base64, see MDN documentation.

Changes Made

// Before
if (encoding) {
  const buffer = new Uint8Array(body.length);
  for (let i = 0; i < body.length; i++) buffer[i] = body.charCodeAt(i);
  body = new TextDecoder(encoding).decode(buffer);
}
// After
// Always handle encoding with utf-8 as fallback
encoding = charset || "utf-8";
const buffer = Uint8Array.from(body, (c) => c.charCodeAt(0));
body = new TextDecoder(encoding).decode(buffer);

Key Improvements

  1. Always perform character encoding conversion, not just when charset is specified
  2. Use "utf-8" as fallback encoding when charset is not specified
  3. Use Uint8Array.from() for more concise and efficient buffer creation
  4. Ensure consistent handling of all non-ASCII characters

Testing

Tested with HTML files containing:

  • Chinese characters
  • Mixed ASCII and non-ASCII content

All characters now display correctly regardless of whether charset is explicitly specified in the data URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant