-
-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output HTML contains NULL chracters in at least CJK languages #9985
Comments
Have you checked if it's an MDX issue? Hard to believe Docusaurus has anything to do here. I can also test later. |
I will check other CJK sites built with other software (e.g. Astro & Nextra). |
When I'm debugging this, I usually isolate an MDX compiler with the same setup as Docusaurus, and invoke it programmatically. |
None of Astro & Nextra sites seem to be affected.
Rspress, which also uses MDX (maybe uses mdxjs-rs or markdown-rs instead), is not affected. However, The document of Ant Design is affected. (They do not use Docusaurus or MDX but only remark. Also, the demo of
|
Hey To be honest I'm not super familiar with any of those concepts and won't have the bandwidth to investigate much 😅 I was just wondering, couldn't this be a Crowdin translation issue? I'm not super skilled in |
No NULL characters are found in html, md, mdx, json, or css files in your ZIP archive.
I found this issue in my (our) site where i18n is not applied, so I am convinced that Crowdin is not concerned with it. |
Thanks for investigating. Also worth giving a try to use this env variable on your site when building: |
Neither of |
https://typescriptbook.jp/ (https://github.com/yytypescript/book) This site uses Docusaurus 2.4.1, and NULL chars are not found there. |
I will check this afternoon. There's a chance that there's something environment specific. |
I found both Docusaurus and Ant Design website have And looks like https://ant.design/docs/blog/line-ellipsis-cn doesn't contain NULL now. |
In the pnpm Japanese documents, (only) the following pages contains NULL:
Blog and older versions have not been checked. Some pages contain but some don't. |
I found the top page of the Docusaurus homepage in some languages has NULL:
|
I found the following pages contain NULL, too.
This shows Docusaurus 2.4.3 also has this problem. |
I meet the same issue, but can not reproduce by simple '@mdx-js/mdx' demo. I find a similar issue in terser plugin, maybe the NUL byte is caused by some core dependencies? |
I run with |
While working on #10554 I also noticed the new minifier reporter errors, even on our own website. The minifier reported NULL chars for these paths: - "/blog/2017/12/14/introducing-docusaurus"
- "/blog/releases/3.5"
- "/changelog"
- "/changelog/2.0.0-alpha.51"
- "/changelog/2.0.0-beta.10"
- "/changelog/2.3.0"
- "/tests/docs/toc/toc-test-bad"
- "/docs/migration/v3" Error: Can't render static file for pathname "/docs/migration/v3"
at generateStaticFile (/Users/sebastienlorber/Desktop/projects/docusaurus/packages/docusaurus/lib/ssg.js:118:15)
at async /Users/sebastienlorber/Desktop/projects/docusaurus/node_modules/p-map/index.js:57:22 {
[cause]: Error: HTML minification failed (SWC)
at Object.minifyHtmlWithSwc [as minify] (/Users/sebastienlorber/Desktop/projects/docusaurus/packages/docusaurus-bundler/lib/minifyHtml.js:107:23)
at async generateStaticFile (/Users/sebastienlorber/Desktop/projects/docusaurus/packages/docusaurus/lib/ssg.js:106:25)
at async /Users/sebastienlorber/Desktop/projects/docusaurus/node_modules/p-map/index.js:57:22 {
[cause]: Error: HTML minification diagnostic errors:
- [error] Unexpected null character - {"primary_spans":[{"end":111132,"start":111131}],"span_labels":[]}
- [error] Unexpected null character - {"primary_spans":[{"end":111132,"start":111131}],"span_labels":[]} Source MDX {/* prettier-ignore */}
```mdx title="japanese.mdx"
<strong>「。」の後に文を続けると`**`が意図した動作をしません。</strong>また、<strong>[リンク](https://docusaurus.io/)</strong>や<strong>`コード`</strong>のすぐ外側に`**`、そのさらに外側に句読点以外がある場合も同様です。
``` More precisely the NULL char occurs here I've used this local function to have a better estimate of the position: function reportNullChar(str: string) {
const nullPos = str.indexOf('\0');
if (nullPos !== -1) {
const printAround = 100;
const before = str.substring(Math.max(0, nullPos - printAround), nullPos);
const after = str.substring(
nullPos + 1,
Math.min(str.length, nullPos + printAround),
);
console.warn(`HTML contains NULL char
Before: ${before}
After: ${after}
`);
}
} From my analysis, the output of MDX doesn't contain null chars. But the output of the React renderer does, so the NULL char appears probably in-between. I'm pretty sure this is not limited to CJK languages, because I also have an error around this heading of "/blog/releases/3.5", and removing it fixes the error: @tats-u I'm not super familiar with your Bash commands. How could I easily check if the |
Note that it doesn't seem to be a Webpack problem. I tried with Rspack and still get the error (or they ported the bug 🤷♂️ ) This reproduces consistently on our v3.5 blog post. I was able to "shrink it" to this smaller version: ---
title: Docusaurus 3.5
authors: [slorber]
tags: [release]
image: ./img/social-card.png
date: 2024-08-09
---
We are happy to announce **Docusaurus 3.5**.
This release contains many **new exciting blog features**.
Upgrading should be easy. Our [release process](/community/release-process) respects [Semantic Versioning](https://semver.org/). Minor versions do not include any breaking changes.
![Docusaurus blog post social card](./img/social-card.png)
{/* truncate */}
## Highlights
### Blog Social Icons
In [#10222](https://github.com/facebook/docusaurus/pull/10222), we added the possibility to associate social links to blog authors, for inline authors declared in front matter or global through the `authors.yml` file.
```yml title="blog/authors.yml"
slorber:
name: Sébastien Lorber
# other author properties...
# highlight-start
socials:
x: sebastienlorber
linkedin: sebastienlorber
github: slorber
newsletter: https://thisweekinreact.com
# highlight-end
```
![Author socials screenshot displaying `slorber` author with 4 social platform icons](./img/author-socials.png)
Icons and handle shortcuts are provided for pre-defined platforms `x`, `linkedin`, `github` and `stackoverflow`. It's possible to provide any additional platform entry (like `newsletter` in the example above) with a full URL.
### Blog Authors Pages The null char happens in the markup around the last heading. Deleting the heading fixes it. Even more surprising, removing the |
Finally found the bug location. The problem is in our React 18 SSG integration. Replacing https://github.com/facebook/docusaurus/blob/main/packages/docusaurus/src/client/renderToHtml.tsx import type {ReactNode} from 'react';
import {renderToPipeableStream} from 'react-dom/server';
import {Writable} from 'stream';
export async function renderToHtml(app: ReactNode): Promise<string> {
// Inspired from
// https://react.dev/reference/react-dom/server/renderToPipeableStream#waiting-for-all-content-to-load-for-crawlers-and-static-generation
// https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/cache-dir/static-entry.js
const writableStream = new WritableAsPromise();
const {pipe} = renderToPipeableStream(app, {
onError(error) {
writableStream.destroy(error as Error);
},
onAllReady() {
pipe(writableStream);
},
});
return writableStream.getPromise();
}
// WritableAsPromise inspired by https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/cache-dir/server-utils/writable-as-promise.js
/* eslint-disable no-underscore-dangle */
class WritableAsPromise extends Writable {
private _output: string;
private _deferred: {
promise: Promise<string> | null;
resolve: (value: string) => void;
reject: (reason: Error) => void;
};
constructor() {
super();
this._output = ``;
this._deferred = {
promise: null,
resolve: () => null,
reject: () => null,
};
this._deferred.promise = new Promise((resolve, reject) => {
this._deferred.resolve = resolve;
this._deferred.reject = reject;
});
}
override _write(
chunk: {toString: () => string},
_enc: unknown,
next: () => void,
) {
this._output += chunk.toString();
next();
}
override _destroy(error: Error | null, next: (error?: Error | null) => void) {
if (error instanceof Error) {
this._deferred.reject(error);
} else {
next();
}
}
override end() {
this._deferred.resolve(this._output);
return this.destroy();
}
getPromise(): Promise<string> {
return this._deferred.promise!;
}
} Edit: it could be a React bug: https://x.com/joshcstory/status/1842254523194314900 |
At Jochen Schweizer, I've used e.g. as an example: class WritableStream extends Writable {
html = '';
decoder = new TextDecoder();
_write(chunk, enc, next) {
this.html += this.decoder.decode(chunk, { stream: true });
next();
}
destroy() {
this.decoder = null;
this.html = null;
}
} |
Thanks Yes our streaming to promise thing is buggy, I got at least 3 better solutions here, one of them being TextEncoder https://x.com/phry/status/1842301184763425075?t=CqSlq7pLVEjyu2fpHqs_sQ&s=19 |
TLDR: I reported a React bug facebook/react#31134 I thought our For example this code: import type {ReactNode} from 'react';
import {renderToPipeableStream, renderToString} from 'react-dom/server';
import {PassThrough} from 'node:stream';
import {text} from 'node:stream/consumers';
export async function renderToHtml(app: ReactNode): Promise<string> {
return new Promise<string>((resolve, reject) => {
const passThrough = new PassThrough();
const {pipe} = renderToPipeableStream(app, {
onError(error) {
reject(error);
},
onAllReady() {
pipe(passThrough);
text(passThrough).then(resolve, reject);
},
});
});
} When adding this little test code: if (html.includes('\0')) {
const goodHtml = renderToString(app);
throw new Error(`renderToPipeableStream HTML contains null chars
renderToPipeableStream HTML length = ${html.length}
renderToString HTML length = ${goodHtml.length}
renderToString HTML contains contains null chars??? = ${goodHtml.includes('\0')}
`);
} This will error with:
Exact same for: import type {ReactNode} from 'react';
import {renderToPipeableStream, renderToString} from 'react-dom/server';
import {PassThrough, Readable} from 'node:stream';
export async function renderToHtml(app: ReactNode): Promise<string> {
return new Promise<string>((resolve, reject) => {
const {pipe} = renderToPipeableStream(app, {
onError(error) {
reject(error);
},
onAllReady() {
const passThrough = new PassThrough();
pipe(passThrough);
const webStream = Readable.toWeb(passThrough);
// @ts-expect-error: temp
new Response(webStream).text().then(resolve, reject);
},
});
});
} Exact same result for: class WritableStream extends Writable {
html = '';
decoder = new TextDecoder();
// @ts-expect-error: temp
_write(chunk, enc, next) {
this.html += this.decoder.decode(chunk, {stream: true});
next();
}
}
export async function renderToHtml(app: ReactNode): Promise<string> {
return new Promise<string>((resolve, reject) => {
const {pipe} = renderToPipeableStream(app, {
onError(error) {
reject(error);
},
onAllReady() {
const writeableStream: WritableStream = new WritableStream();
pipe(writeableStream);
resolve(writeableStream.html);
},
});
});
} Note sure if I'm supposed to use a specific TextEncoder encoding, but I tried various ones and didn't get any improvement. Note: the paths that generate NULL chars on our Docusauru website are: [cause]: Error: Docusaurus static site generation failed for 8 paths:
- "/blog/2017/12/14/introducing-docusaurus"
- "/blog/releases/3.5"
- "/changelog"
- "/changelog/2.0.0-alpha.51"
- "/changelog/2.0.0-beta.10"
- "/changelog/2.3.0"
- "/tests/docs/toc/toc-test-bad"
- "/docs/migration/v3" Some paths generate more than one NULL chars, for example The extra chars are always NULL chars, and this always prints true: `Equal without null chars = ${html.replace(/\0/g, '') === goodHtml}` Note: I doubt React v18 will fix it, so maybe for Docusaurus v3.x we could just apply this workaround temporarily: |
Looks like using |
grep -lF $'\x00' *.html
find -name '*.html' -type f -exec grep -lF $'\x00' {} + Anyway glad that we were able to find this is presumably due to a bug of React itself. |
Sorry I should have used the -P option instead. (In macOS use -E instead) grep -lPa '\x00' *.html find \( -name '*.html' -o -name '*.js' \) -type f -exec grep -lPa '\x00' {} + |
@tats-u I believe our new HTML minifier (available in canary, upcoming v3.6) fixes the null chars: #10554 With this new minifier, this emits nulls: SKIP_HTML_MINIFICATION=true yarn build:website:fast
rg '\x00' -a -r '[[NULL]]' --color=always -t html website/build | perl -C -pe 'use utf8; s/^.+?(.{50})(?=\[\[NULL)/...\1/' This doesn't emit nulls, but the minifier reports a warning instead: yarn build:website:fast
rg '\x00' -a -r '[[NULL]]' --color=always -t html website/build | perl -C -pe 'use utf8; s/^.+?(.{50})(?=\[\[NULL)/...\1/' Until we figure out the React SSR/SSG bug, I'll silent that minifier warning. Can you please check on our website or using a canary locally and tell us if you still see any NULL char? |
@slorber I confirmed it in the official website and the Japanese staging site.
|
Great, so at least we have a decent workaround to the possible React bug, available in canary and soon v3.6. |
Have you read the Contributing Guidelines on issues?
Prerequisites
npm run clear
oryarn clear
command.rm -rf node_modules yarn.lock package-lock.json
and re-installing packages.Description
Docusarus sometimes contaminate output HTMLs with NULL chracters.
NULL characters confuses some HTML parsers used in some document scraper like https://github.com/meilisearch/docs-scraper. (it uses lxml written in Python)
Also it prevents Windows' copy-and-paste feature from copying the complete source code.
Reproducible demo
No response
Steps to reproduce
Note
rg
is ripgrep.For your own documents
Write your documents in CJK or possibly other non-latin languages and then do:
Note
Built JS files do not seem to be affected. (no NULs are found there)
Expected behavior
No outputs (NULL characters are not found)
Actual behavior
🇨🇳
🇯🇵
🇰🇷
Note
Your environment
First found private document site written in Japanese:
The above commands are run in Ubuntu 22.04 on WSL on Windows 11.
Self-service
The text was updated successfully, but these errors were encountered: