Skip to content

fix(skills): force explicit UTF-8 on locale-default Python file reads#524

Open
H179922 wants to merge 1 commit into
Egonex-AI:mainfrom
multimail-dev:fix/domain-context-utf8-encoding
Open

fix(skills): force explicit UTF-8 on locale-default Python file reads#524
H179922 wants to merge 1 commit into
Egonex-AI:mainfrom
multimail-dev:fix/domain-context-utf8-encoding

Conversation

@H179922

@H179922 H179922 commented Jun 29, 2026

Copy link
Copy Markdown

Problem

Two Python files read with no explicit encoding=. pathlib's read_text/write_text fall back to locale.getpreferredencoding(False) when none is given. On Windows — absent Python UTF-8 mode (PEP 540) — that's the ANSI codepage (cp1252 / cp936 / cp932), not UTF-8.

skills/understand-domain/extract-domain-context.py (lines 126, 211, 270, 315, 418) reads source files, .gitignore, and metadata, then writes domain-context.json. On Windows the reads decode UTF-8 source against the wrong codepage and silently corrupt the context that the domain-analyzer agent consumes — CJK comments, accented identifiers, em-dashes, the translated README text pulled in at line 332.

errors="replace" does not catch this: legacy codepages decode almost every byte to some (wrong) character instead of raising, so there's no UnicodeDecodeError and no replacement char — just silent mojibake.

Reproduction

A UTF-8 source file with a CJK comment, read the way Windows would:

source (utf-8) : '# 用户订单服务 — order service'
cp1252 (EN Win): '# ç”¨æˆ·è®¢å•æœ�务 —'
cp936  (CN Win): '# 鐢ㄦ埛璁㈠崟鏈嶅姟 鈥 order s'

Both alternate reads used errors="replace" — neither raised, both are silently wrong.

tests/skill/understand/test_merge_batch_graphs.py (lines 986, 1105) reads assembled-graph.json back for assertions. merge-batch-graphs.py writes it with ensure_ascii=False, so it can contain raw non-ASCII that the unencoded read would mojibake on Windows, failing the test there.

Fix

Pass encoding="utf-8" to the affected reads/write. This matches the convention already used in every other Python script in the repo (merge-batch-graphs.py, merge-subdomain-graphs.py, parse-knowledge-base.py, merge-knowledge-graph.py). No behavior change on Linux/macOS, where the locale default is already UTF-8.

Scope

7 lines across 2 files. Relates to the Windows-compat reports #262 and #340.

The domain-context scanner and a merge test read files via
Path.read_text(errors="replace") with no explicit encoding. pathlib's
read_text/write_text fall back to locale.getpreferredencoding(False) when
encoding is omitted; on Windows, absent Python UTF-8 mode (PEP 540), that is
the ANSI codepage (cp1252 / cp936 / cp932), not UTF-8.

extract-domain-context.py (lines 126, 211, 270, 315, 418) reads source files,
.gitignore, and metadata, then writes domain-context.json. On Windows the
reads decode UTF-8 source against the wrong codepage, silently mojibaking CJK
comments, accented identifiers, em-dashes, and translated README text into the
context fed to the domain-analyzer agent. errors="replace" does not catch
this: legacy codepages decode nearly every byte to a *wrong* character rather
than raising, so corruption is silent.

test_merge_batch_graphs.py (lines 986, 1105) reads assembled-graph.json back
for assertions; merge-batch-graphs.py writes it with ensure_ascii=False, so it
can contain raw non-ASCII that the unencoded read would mojibake on Windows.

Both now pass encoding="utf-8", matching every other Python script in the repo
(merge-batch-graphs.py, merge-subdomain-graphs.py, parse-knowledge-base.py,
merge-knowledge-graph.py). No behavior change on Linux/macOS, where the locale
default is already UTF-8.

Relates to the Windows-compat reports Egonex-AI#262, Egonex-AI#340.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant