Skip to content

tree-sitter-qmd - allow multi-line attribute lists on inline images/spans#209

Open
rundel wants to merge 5 commits into
quarto-dev:mainfrom
rundel:bugfix/multiline-inline-attrs
Open

tree-sitter-qmd - allow multi-line attribute lists on inline images/spans#209
rundel wants to merge 5 commits into
quarto-dev:mainfrom
rundel:bugfix/multiline-inline-attrs

Conversation

@rundel
Copy link
Copy Markdown
Contributor

@rundel rundel commented May 15, 2026

Not sure if this is even something you want to support or not - if the latter feel free to close this. Otherwise this is an attempt at a semi-minimally invasive fix for multi-line attributes.

Multi-line {...} attribute lists on inline images, spans, and other inline constructs that share _pandoc_attr_specifier were rejected at the first attribute. The same form rendered fine in TS Quarto and was common in quarto-web sources, leaving q2 as the lone holdout.

Root cause was twofold in tree-sitter-markdown/grammar.js:

  • attribute_specifier / _pandoc_attr_specifier had no tolerance for whitespace between { and the first specifier (or between _commonmark_specifier_start_with_class and } — also broke the single-line { .cls } form).
  • _inline_whitespace is choice($._whitespace, $._soft_line_break), a single token, so a \n followed by indent on the next line couldn't match as one inter-attribute separator.

Introduce _attr_ws: prec(-1, repeat1(choice($._whitespace, $._soft_line_break))) and use it where attribute lists need to span lines:

  • optional($._attr_ws) immediately after { in both attribute_specifier and _pandoc_attr_specifier.
  • Add trailing optional($._attr_ws) inside _commonmark_specifier_start_with_class (already present on _commonmark_specifier_start_with_kv). Keeping the trailer inside the specifier (rather than around the choice at the wrapper) avoids an LR conflict with language_specifier.
  • Swap inner _inline_whitespace_attr_ws at the four sites that join successive classes, classes-to-kv, and successive kvs.

commonmark_specifier's leading optional($._inline_whitespace) is left untouched — changing it triggered the language_specifier conflict, and the wrapper-level _attr_ws already absorbs anything that would have leaked through.

Regenerated artifacts (parser.c, grammar.json, node-types.json) are produced by tree-sitter generate; do not hand-edit. Also regenerated crates/pampa/resources/error-corpus/_autogen-table.json via deno run -A scripts/build_error_table.ts because parser-state IDs shifted — without this regen, 7 apostrophe-quotes tests in qmd-syntax-helper fail (the rule keys off Q-2-10 diagnostics which map through (lr_state, sym) pairs).

Tests:

  • 6 new tree-sitter corpus cases in inline-multiline-attrs.txt cover multi-line class-only, multi-line class + key=value, { .cls } symmetry, span multi-line class, span multi-line kv, and # Heading { .cls } for the block-level path.
  • 1 new pampa integration test in test_attr_source_parsing.rs verifies that the AST and attr_source byte offsets are correct for the quarto-web-style multi-line form.

Full tree-sitter corpus: 492/492 pass (was 486). Full workspace: 8943/8943 pass. cargo xtask verify Rust legs all green.

End-to-end check:

$ pampa repro.qmd
[ Header 1 ( "test" , [] , [] ) [Str "Test"]
, Para [Image ( "" , ["hero-banner", "img-fluid"]
, [("fig-align", "center"), ("width", "600px")] )
[] ("featured.png" , "")]
, Para [Str "Done."] ]

…es/spans

Multi-line `{...}` attribute lists on inline images, spans, and other
inline constructs that share `_pandoc_attr_specifier` were rejected at
the first attribute. The same form rendered fine in TS Quarto and was
common in quarto-web sources, leaving q2 as the lone holdout.

Root cause was twofold in `tree-sitter-markdown/grammar.js`:

  - `attribute_specifier` / `_pandoc_attr_specifier` had no tolerance
    for whitespace between `{` and the first specifier (or between
    `_commonmark_specifier_start_with_class` and `}` — also broke the
    single-line `{ .cls }` form).
  - `_inline_whitespace` is `choice($._whitespace, $._soft_line_break)`,
    a single token, so a `\n` followed by indent on the next line
    couldn't match as one inter-attribute separator.

Introduce `_attr_ws: prec(-1, repeat1(choice($._whitespace, $._soft_line_break)))`
and use it where attribute lists need to span lines:

  - `optional($._attr_ws)` immediately after `{` in both
    `attribute_specifier` and `_pandoc_attr_specifier`.
  - Add trailing `optional($._attr_ws)` inside
    `_commonmark_specifier_start_with_class` (already present on
    `_commonmark_specifier_start_with_kv`). Keeping the trailer inside
    the specifier (rather than around the choice at the wrapper)
    avoids an LR conflict with `language_specifier`.
  - Swap inner `_inline_whitespace` → `_attr_ws` at the four sites
    that join successive classes, classes-to-kv, and successive kvs.

`commonmark_specifier`'s leading `optional($._inline_whitespace)` is
left untouched — changing it triggered the language_specifier conflict,
and the wrapper-level `_attr_ws` already absorbs anything that would
have leaked through.

Regenerated artifacts (`parser.c`, `grammar.json`, `node-types.json`)
are produced by `tree-sitter generate`; do not hand-edit. Also
regenerated `crates/pampa/resources/error-corpus/_autogen-table.json`
via `deno run -A scripts/build_error_table.ts` because parser-state
IDs shifted — without this regen, 7 `apostrophe-quotes` tests in
`qmd-syntax-helper` fail (the rule keys off Q-2-10 diagnostics which
map through `(lr_state, sym)` pairs).

Tests:

  - 6 new tree-sitter corpus cases in `inline-multiline-attrs.txt`
    cover multi-line class-only, multi-line class + key=value,
    `{ .cls }` symmetry, span multi-line class, span multi-line kv,
    and `# Heading { .cls }` for the block-level path.
  - 1 new pampa integration test in `test_attr_source_parsing.rs`
    verifies that the AST and `attr_source` byte offsets are correct
    for the quarto-web-style multi-line form.

Full tree-sitter corpus: 492/492 pass (was 486). Full workspace:
8943/8943 pass. `cargo xtask verify` Rust legs all green.

End-to-end check:

  $ pampa repro.qmd
  [ Header 1 ( "test" , [] , [] ) [Str "Test"]
  , Para [Image ( "" , ["hero-banner", "img-fluid"]
                , [("fig-align", "center"), ("width", "600px")] )
           [] ("featured.png" , "")]
  , Para [Str "Done."] ]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 15, 2026

I'm not against this fix, but anything involving line breaks in Markdown gives me a case of the heebie jeebies. I think it's prudent to add tests that exercise those spans inside bulleted lists and block quotes before declaring success.

(Operational annoyance: every time a grammar fix makes it in, we need to regenerate the error corpus table and parser.c autogen file. There's instructions for Claude to do it in the repo but you'll need to install Deno.)

rundel and others added 2 commits May 15, 2026 18:11
…e error

Adds edge-case coverage for the multi-line inline-attribute fix:

* tree-sitter corpus (`inline-multiline-attrs.txt`): six new cases for
  multi-line `{...}` attribute lists inside list items — bulleted,
  ordered, nested, image-in-list, plus two sibling-bullet variants
  (top-level and nested) that exercise list continuation across
  preceding and following items.
* pampa integration tests (`test_attr_source_parsing.rs`): five new
  AST-level tests mirroring the corpus cases — verifying that classes,
  key/value pairs, and `attr_source` byte offsets survive the
  block→inline boundary for each list shape.

Adds a new error code Q-2-37 for the one shape the grammar fix
cannot reach — multi-line attribute lists inside a blockquote. The
tree-sitter external scanner short-circuits SOFT_LINE_ENDING when the
next line begins with `>` (`scanner.c:2380-2407`), so the inline pass
only sees the first physical line of the attribute list and Q-2-2
fires at the same `(state=2587, sym="_close_block")` pair as a
plain top-level unclosed `{`. The two cases are indistinguishable at
the error-table lookup level but distinguishable in the source text.

* `resources/error-corpus/Q-2-37.json` — documents the new code with
  `cases: []` (no state mapping; this entry is emitted manually).
* `readers/qmd_error_messages.rs::upgrade_q22_to_q237_if_in_blockquote`
  — post-processes each Q-2-2 diagnostic: if the line of the failing
  `{` (after stripping leading whitespace) begins with `>`, rewrites
  `code`, `title`, `problem`, `hints`, and clears inherited
  `details` so the message reads cleanly without the Q-2-2 anchor
  note.
* `tests/test_q_2_37_blockquote_multiline_attrs.rs` — four tests:
  image-in-blockquote upgrades to Q-2-37; span-in-blockquote upgrades;
  blockquote with leading indent still upgrades; top-level `[attr]{[`
  stays Q-2-2 (negative control).

Full tree-sitter corpus: 495/495 (was 489 pre-fix, 493 after the
initial commit on this branch). Full workspace: 8952/8952.

End-to-end check on a real blockquote case:

  Error: [Q-2-37] Multi-line inline attribute list inside blockquote
   1 │ > ![](img.png){
     │                ╰── Inside a blockquote, an inline `{...}`
     │                    attribute list cannot span multiple lines.
  ℹ Put the attribute list on a single line, or move this construct
    out of the blockquote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ne-attrs

# Conflicts:
#	crates/pampa/resources/error-corpus/Q-2-37.json
#	crates/pampa/resources/error-corpus/_autogen-table.json
#	crates/tree-sitter-qmd/tree-sitter-markdown/src/parser.c
@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 15, 2026

Good call - bulleted lists look like they work fine and block quotes are a nightmare. For now I punted with it just throwing an error on multiline attributes w/in a block quote

I'm not sure if the Q-2-2 upgrade to Q-2-38 behavior based on context is acceptable or not.

Claude's summary is below.


Builds on the earlier multi-line attribute-list fix by adding edge-case coverage and a new diagnostic for the one shape the grammar fix can't reach.

List-context tests

Multi-line {...} attribute lists already work inside list items thanks to list-continuation handling — these tests lock that in against regression.

Tree-sitter corpus (inline-multiline-attrs.txt, 6 new cases):

  • Bulleted list item with multi-line span attrs.
  • Ordered list item with multi-line span attrs.
  • Nested bulleted list with multi-line span attrs in the inner item.
  • Bulleted list item with multi-line image attrs (class + key=value).
  • Top-level bullets with sibling items before and after the multi-line attrs at the same indent.
  • Nested bullets with sibling inner items before and after at the same indent.

Pampa integration tests (test_attr_source_parsing.rs, 5 new cases): AST-level mirrors of the above, verifying classes, key/value pairs, and attr_source byte offsets survive the block→inline boundary for each shape.

Q-2-38 — blockquote-specific error

Multi-line attribute lists inside a > blockquote don't work: the tree-sitter external scanner short-circuits SOFT_LINE_ENDING when the next line begins with > (scanner.c:2380-2407), so the inline parser sees only the first physical line and the same (state=2587, sym="_close_block") fires as a top-level unclosed { — indistinguishable at the lookup level, but distinguishable in the source text.

qmd_error_messages.rs::upgrade_q22_to_q238_if_in_blockquote post-processes each Q-2-2 diagnostic: if the error line begins with > (after optional leading whitespace), the diagnostic is rewritten to Q-2-38:

Error: [Q-2-38] Multi-line inline attribute list inside blockquote
 1 │ > ![](img.png){
   │                ╰── Inside a blockquote, an inline `{...}` attribute list cannot span multiple lines.
ℹ Put the attribute list on a single line, or move this construct out of the blockquote.
  • resources/error-corpus/Q-2-38.json documents the new code (cases: [] — manually emitted, not state-mapped).
  • tests/test_q_2_38_blockquote_multiline_attrs.rs: 4 tests — image-in-bq, span-in-bq, blockquote with leading indent, and a top-level negative control confirming [attr]{[ still maps to Q-2-2.

@cscheid
Copy link
Copy Markdown
Member

cscheid commented May 16, 2026

I think I want to leave this as an open issue until we can handle it uniformly. Creating these syntax exceptions is sort of opening the door for future trouble. I'd rather us reject those attributes uniformly even if it's annoying.

@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 16, 2026

That sounds reasonable - I'll take a deeper look at the block quote variant

rundel and others added 2 commits May 16, 2026 12:49
`BLOCK_QUOTE.match` (scanner.c:568) consumes `>` plus at most one
optional space; any additional gutter alignment left on the
continuation line stalls the second SOFT_LINE_ENDING gate
(scanner.c:2536) at its `second_lookahead > ' '` check and forces a
paragraph-terminating LINE_ENDING. The inline `_attr_ws` rule cannot
consume LINE_ENDING, so a multi-space `{...}` continuation inside a
blockquote became a hard parse error (Q-2-38, added in 0ca1a40 as a
contextual rewrite over Q-2-2). `LIST_ITEM*.match` consumes the full
continuation indent intrinsically and `FENCED_DIV.match` doesn't
advance, so list/div contexts were unaffected — only blockquote
showed the asymmetry.

Scanner change (six lines): after `match_line` and before the
second-gate lookahead read, consume any leftover gutter whitespace.
Gated on the matched stack containing a `BLOCK_QUOTE` and on
`all_will_be_matched && might_be_soft_break` to skip partial
nested-blockquote matches and ATX contexts; the BLOCK_QUOTE check
ensures we never pre-consume indent at top level (which would regress
fenced-div / caption recognition).

Side-effect fix: regular blockquote paragraphs with multi-space gutter
alignment (`> foo\n>  bar`) now produce one Para with a SoftBreak
instead of two separate Paras.

* `scanner.c` — six-line whitespace-consume loop with rationale.
* `inline-multiline-attrs.txt` — four corpus cases (image attrs, span
  attrs, nested blockquote, leading-indent blockquote). Total
  tree-sitter corpus: 501/501 (was 497 baseline).
* `resources/error-corpus/Q-2-38.json` — deleted.
* `readers/qmd_error_messages.rs` — drop
  `upgrade_q22_to_q238_if_in_blockquote` and its call site.
* `tests/test_blockquote_multiline_attrs.rs` (renamed from
  `test_q_2_38_blockquote_multiline_attrs.rs`) — invert assertions:
  the image/span/leading-indent cases now expect successful parse with
  the attribute_specifier shape; the top-level unclosed `{[` case
  keeps the Q-2-2 expectation and additionally asserts Q-2-38 never
  appears.

Verification: 8956/8956 Rust tests pass; tree-sitter test 501/501;
end-to-end pampa render confirms BlockQuote contains attributed
Image/Span/nested-Span in JSON output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Q-2-38 code was deleted earlier on this branch (e3b315b; the old
contextual rewrite over Q-2-2). Reviving it with new semantics that
mirror Q-2-1 "Unclosed Span": fires when the parser reaches the end of
the block while still inside an `{...}` attribute list, with an
anchored "starts here" note pointing back at the opening `{`.

Before this change, every unclosed-attr shape past the first
attribute (e.g. `[span]{.cls`, `[span]{\n  .cls\n`, image variants,
in-blockquote variants) fell through to the generic "Parse error /
unexpected character or token here" message with no anchor.

7 corpus cases cover the distinct LR states the parser actually visits
post-attribute:

| Case                       | LR state | Shape                              |
| -------------------------- | -------- | ---------------------------------- |
| eof-after-id               | 2818     | `[t]{#id`                          |
| eof-after-class            | 2746     | `[t]{.cls`                         |
| eof-after-kv               | 3069     | `[t]{k="v"`                        |
| multi-line-eof             | 2746     | `[t]{\n  .cls\n`                   |
| image-multi-line-eof       | 2746     | `![](i.png){\n  .cls\n`            |
| multi-line-eof-two-classes | 3279     | `[t]{\n  .a\n  .b\n`               |
| multi-line-eof-class-and-kv| 3069     | `[t]{\n  .cls\n  k="v"\n`          |
| multi-line-eof-two-kvs     | 3069     | `[t]{\n  k1="v1"\n  k2="v2"\n`     |

New test file `test_unclosed_attr_specifier.rs` pins Q-2-38 emission
on each of these plus a blockquote-context case and a
list-item-terminated case. Two tests assert that the bare-`{` shapes
(`[t]{` and `![](i.png){`) resolve to Q-2-2 — they share LR state 2587
with the existing `{[` mismatched-delimiter case, so the parser
cannot distinguish them at that level.

The negative-control assertion in
`test_blockquote_multiline_attrs::toplevel_unclosed_attr_stays_q_2_2`
is updated: Q-2-38 now exists with new semantics, and the test
documents the Q-2-2 vs Q-2-38 boundary by asserting Q-2-38 does NOT
fire on the `{[` mismatched-delimiter input.

Open question for review: state 2587 means the parser can't
distinguish "bare `{` then EOF" from "`{[` mismatched delimiter".
Q-2-2 currently wins that pair with the message "Mismatched
Delimiter…", which is accurate for `{[` but slightly off for the bare
unclosed case. We may want to consolidate Q-2-2 and Q-2-38 into one
broader code, or relax Q-2-2's title to cover both shapes. Deferring
that decision to keep this PR focused on closing the diagnostic gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rundel
Copy link
Copy Markdown
Contributor Author

rundel commented May 16, 2026

Changes above resolve the issue with block quotes being different - everything seems consistent now without breaking anything else obvious. Additional tests and error reporting added.

Claude's summary + outstanding issues


  • Q-2-38.json with 7 corpus cases mapping the distinct LR states the parser reaches after partial attribute content: {#id, {.cls, {k="v", multi-line with one class, image-multi-line, multi-line with two classes (state 3279 is its own slot), multi-line class+kv, multi-line two kvs.
  • test_unclosed_attr_specifier.rs pinning Q-2-38 emission on each of the above shapes plus an in-blockquote case and a * bullet [text]{.cls\n* next case (next list item closes the block while { is still open).
  • Two extra tests in the same file asserting that the bare { shapes ([text]{ and ![](img.png){) resolve to Q-2-2, not Q-2-38 — see below.
  • test_blockquote_multiline_attrs::toplevel_unclosed_attr_stays_q_2_2 updated to assert Q-2-38 does NOT fire on the {[ mismatched-delimiter input.

Q-2-2 ⇄ Q-2-38 boundary — open question for review

The LR table cannot distinguish the bare-{-then-EOF case from the {[ mismatched-delimiter case: both hit (state=2587, sym=_close_block). With both Q-2-2 and Q-2-38 registered there, the diagnostic scorer picks Q-2-2 (highest scoring wins; ties go to last-registered, and I deliberately omitted the simple-eof case from Q-2-38 so 2587 has only one entry).

Net effect:

Input shape LR state Diagnostic
A bad [attr]{[ 2587 Q-2-2 "Mismatched Delimiter"
[t]{ (bare unclosed) 2587 Q-2-2 "Mismatched Delimiter" (slight misnomer)
[t]{.cls, [t]{#id, multi-line, etc. 2746/2818/3069/3279 Q-2-38 "Unclosed Attribute Specifier"

Two options for follow-up, neither blocking:

  1. Merge Q-2-2 and Q-2-38 into one broader code (e.g. "Unfinished Attribute Specifier") with a message that covers both shapes. Cleaner, but loses the title distinction Q-2-1 / Q-2-12 / etc. enjoy.
  2. Keep them split as today and rename Q-2-2's title to something more shape-agnostic ("Invalid Attribute Specifier"?) so the bare-{ case isn't visibly misnamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants