fix(skills): force explicit UTF-8 on locale-default Python file reads by H179922 · Pull Request #524 · Egonex-AI/Understand-Anything

H179922 · 2026-06-29T01:58:32Z

Problem

Two Python files read with no explicit encoding=. pathlib's read_text/write_text fall back to locale.getpreferredencoding(False) when none is given. On Windows — absent Python UTF-8 mode (PEP 540) — that's the ANSI codepage (cp1252 / cp936 / cp932), not UTF-8.

skills/understand-domain/extract-domain-context.py (lines 126, 211, 270, 315, 418) reads source files, .gitignore, and metadata, then writes domain-context.json. On Windows the reads decode UTF-8 source against the wrong codepage and silently corrupt the context that the domain-analyzer agent consumes — CJK comments, accented identifiers, em-dashes, the translated README text pulled in at line 332.

errors="replace" does not catch this: legacy codepages decode almost every byte to some (wrong) character instead of raising, so there's no UnicodeDecodeError and no replacement char — just silent mojibake.

Reproduction

A UTF-8 source file with a CJK comment, read the way Windows would:

source (utf-8) : '# 用户订单服务 — order service'
cp1252 (EN Win): '# ç”¨æˆ·è®¢å•æœ�åŠ¡ â€”'
cp936  (CN Win): '# 鐢ㄦ埛璁㈠崟鏈嶅姟 鈥 order s'

Both alternate reads used errors="replace" — neither raised, both are silently wrong.

tests/skill/understand/test_merge_batch_graphs.py (lines 986, 1105) reads assembled-graph.json back for assertions. merge-batch-graphs.py writes it with ensure_ascii=False, so it can contain raw non-ASCII that the unencoded read would mojibake on Windows, failing the test there.

Fix

Pass encoding="utf-8" to the affected reads/write. This matches the convention already used in every other Python script in the repo (merge-batch-graphs.py, merge-subdomain-graphs.py, parse-knowledge-base.py, merge-knowledge-graph.py). No behavior change on Linux/macOS, where the locale default is already UTF-8.

Scope

7 lines across 2 files. Relates to the Windows-compat reports #262 and #340.

The domain-context scanner and a merge test read files via Path.read_text(errors="replace") with no explicit encoding. pathlib's read_text/write_text fall back to locale.getpreferredencoding(False) when encoding is omitted; on Windows, absent Python UTF-8 mode (PEP 540), that is the ANSI codepage (cp1252 / cp936 / cp932), not UTF-8. extract-domain-context.py (lines 126, 211, 270, 315, 418) reads source files, .gitignore, and metadata, then writes domain-context.json. On Windows the reads decode UTF-8 source against the wrong codepage, silently mojibaking CJK comments, accented identifiers, em-dashes, and translated README text into the context fed to the domain-analyzer agent. errors="replace" does not catch this: legacy codepages decode nearly every byte to a *wrong* character rather than raising, so corruption is silent. test_merge_batch_graphs.py (lines 986, 1105) reads assembled-graph.json back for assertions; merge-batch-graphs.py writes it with ensure_ascii=False, so it can contain raw non-ASCII that the unencoded read would mojibake on Windows. Both now pass encoding="utf-8", matching every other Python script in the repo (merge-batch-graphs.py, merge-subdomain-graphs.py, parse-knowledge-base.py, merge-knowledge-graph.py). No behavior change on Linux/macOS, where the locale default is already UTF-8. Relates to the Windows-compat reports Egonex-AI#262, Egonex-AI#340. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(skills): force explicit UTF-8 on locale-default Python file reads#524

fix(skills): force explicit UTF-8 on locale-default Python file reads#524
H179922 wants to merge 1 commit into
Egonex-AI:mainfrom
multimail-dev:fix/domain-context-utf8-encoding

H179922 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

H179922 commented Jun 29, 2026

Problem

Reproduction

Fix

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant