Symptom
Every totem-lint SARIF upload to GitHub Code Scanning has been silently rejected since 2026-04-06. GitHub's processing fails with:
Invalid SARIF document: unexpected end of hex escape at line 1 column 126526
Nothing surfaces: the Upload SARIF to GitHub Security step reports success (the rejection happens in GitHub's asynchronous processing, and the step is continue-on-error: true per #1226), the Lint job is green, and the error exists only inside the step's log group. It was exposed only because CodeRabbit scrapes check-run annotations into its review summaries — the same quote appeared on #2294 and #2295 at the identical byte column, which is what proved it diff-independent.
Last successful totem-lint analysis (gh api repos/mmnto-ai/totem/code-scanning/analyses): 2026-04-06T03:15:32Z. The audit-trail category the lint.yml comment says is "kept ON deliberately" (the #474 false-positive measurement trail) has been dark for ~3 months.
Root cause
A frozen compiled rule — totem/5afaf8d03f059a41 (emoji-in-markdown detector, category architecture, fileGlobs **/*.md, **/*.mdx) — carries a regex pattern whose character class encodes astral ranges as UTF-16 surrogate-pair ranges, leaving unpaired surrogates in the pattern string (e.g. …\uD83C-…-\uDFFF… split around -).
The failure chain:
totem lint --format sarif embeds the raw pattern string in the rule's properties.pattern. As a JS string this is valid (lone surrogates are legal in UTF-16 strings); our own emitted file is valid JSON (verified locally with JSON.parse).
github/codeql-action/upload-sarif re-serializes the document (fingerprint injection) with well-formed JSON.stringify, which emits lone surrogates as bare \ud83c-style escapes on a single line.
- GitHub's SARIF ingestion parser reads the high-surrogate escape, expects the low-surrogate continuation, hits the
- range character instead → unexpected end of hex escape. Document rejected; no analysis recorded.
continue-on-error: true + processing-failure-not-step-failure ⇒ fully silent.
The column is constant across PRs because it falls inside the serialized rules array (387 frozen rules — byte-identical prefix across runs); results (diff-dependent) serialize after.
Reproduced locally: generate the SARIF (node packages/cli/dist/index.js lint --format sarif --out …), JSON.stringify(JSON.parse(file)), inspect around column 126526 → the surrogate-range pattern sits exactly there.
Timing corroboration: the rule's mint window matches the 2026-04-06 last-good analysis.
Why the rule itself is NOT broken
JS regex without the u flag matches by UTF-16 code unit — surrogate-range classes function as authored. Enforcement is unaffected. Only the SARIF export path chokes, so the fix is serialization-side and does not require recompiling frozen rules (compile freeze unaffected; kin to #2055/#2289, not blocked by them).
Fix direction
- SARIF serializer (
lint --format sarif): sanitize rule-metadata strings to well-formed Unicode before embedding — String.prototype.toWellFormed() (Node ≥ 20) on pattern/description fields, or replace unpaired surrogates with �. The pattern property is documentation in the SARIF context, not executable, so lossy well-forming is safe.
- Make future rejections visible: the upload step should poll processing status (
codeql-action supports wait-for-processing: true — verify whether it is on and whether its failure is eaten by continue-on-error), or a follow-up step should assert a new analysis landed for the totem-lint category. Three months of silent darkness is the actual defect; the surrogate is just today's instance.
- Verify: post-fix,
gh api "repos/mmnto-ai/totem/code-scanning/analyses?per_page=100" shows a fresh totem-lint analysis.
Relations
🤖 Generated with Claude Code
Symptom
Every
totem-lintSARIF upload to GitHub Code Scanning has been silently rejected since 2026-04-06. GitHub's processing fails with:Nothing surfaces: the
Upload SARIF to GitHub Securitystep reports success (the rejection happens in GitHub's asynchronous processing, and the step iscontinue-on-error: trueper #1226), the Lint job is green, and the error exists only inside the step's log group. It was exposed only because CodeRabbit scrapes check-run annotations into its review summaries — the same quote appeared on #2294 and #2295 at the identical byte column, which is what proved it diff-independent.Last successful
totem-lintanalysis (gh api repos/mmnto-ai/totem/code-scanning/analyses):2026-04-06T03:15:32Z. The audit-trail category the lint.yml comment says is "kept ON deliberately" (the #474 false-positive measurement trail) has been dark for ~3 months.Root cause
A frozen compiled rule —
totem/5afaf8d03f059a41(emoji-in-markdown detector, categoryarchitecture, fileGlobs**/*.md,**/*.mdx) — carries a regex pattern whose character class encodes astral ranges as UTF-16 surrogate-pair ranges, leaving unpaired surrogates in the pattern string (e.g.…\uD83C-…-\uDFFF…split around-).The failure chain:
totem lint --format sarifembeds the raw pattern string in the rule'sproperties.pattern. As a JS string this is valid (lone surrogates are legal in UTF-16 strings); our own emitted file is valid JSON (verified locally withJSON.parse).github/codeql-action/upload-sarifre-serializes the document (fingerprint injection) with well-formedJSON.stringify, which emits lone surrogates as bare\ud83c-style escapes on a single line.-range character instead →unexpected end of hex escape. Document rejected; no analysis recorded.continue-on-error: true+ processing-failure-not-step-failure ⇒ fully silent.The column is constant across PRs because it falls inside the serialized rules array (387 frozen rules — byte-identical prefix across runs); results (diff-dependent) serialize after.
Reproduced locally: generate the SARIF (
node packages/cli/dist/index.js lint --format sarif --out …),JSON.stringify(JSON.parse(file)), inspect around column 126526 → the surrogate-range pattern sits exactly there.Timing corroboration: the rule's mint window matches the 2026-04-06 last-good analysis.
Why the rule itself is NOT broken
JS regex without the
uflag matches by UTF-16 code unit — surrogate-range classes function as authored. Enforcement is unaffected. Only the SARIF export path chokes, so the fix is serialization-side and does not require recompiling frozen rules (compile freeze unaffected; kin to #2055/#2289, not blocked by them).Fix direction
lint --format sarif): sanitize rule-metadata strings to well-formed Unicode before embedding —String.prototype.toWellFormed()(Node ≥ 20) onpattern/description fields, or replace unpaired surrogates with�. The pattern property is documentation in the SARIF context, not executable, so lossy well-forming is safe.codeql-actionsupportswait-for-processing: true— verify whether it is on and whether its failure is eaten bycontinue-on-error), or a follow-up step should assert a new analysis landed for thetotem-lintcategory. Three months of silent darkness is the actual defect; the surrogate is just today's instance.gh api "repos/mmnto-ai/totem/code-scanning/analyses?per_page=100"shows a freshtotem-lintanalysis.Relations
continue-on-errorposture this failure hid behind)🤖 Generated with Claude Code