Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

DivineDominion · 2021-07-19T11:04:47Z

I noticed that when one uses HTML comments inside a regular paragraph, the token tree contains PAIR_HTML_COMMENT → HTML_COMMENT_START and eventually PAIR_HTML_COMMENT → HTML_COMMENT_STOP (the → denotes is-a-child)

This makes applying highlight colors to HTML comments a bit weird since the comment range is sometimes available as PAIR_HTML_COMMENT, sometimes not. Maybe something odd is underlying all this that you might want to know about @fletcher

Observations

When you put the HTML comment on its own line in an empty document or at the end of a document without a trailing newline, it's the same, e.g.:

BLOCK_PARA
    PAIR_HTML_COMMENT
        HTML_COMMENT_START
        TEXT_PLAIN ...
        HTML_COMMENT_STOP [after this is EOF]

But when the HTML comment spans the whole line and ends in a newline character, the token tree is:

BLOCK_HTML
    HTML_COMMENT_START
    TEXT_PLAIN ...
    HTML_COMMENT_STOP
    TEXT_NL
BLOCK_EMPTY  [here is EOF]

Note that this seems to be a special case, because the following in a Markdown document ...

<div>
<!-- html comment -->
</div>

... produces BLOCK_HTML with multiple LINE_HTML tokens, except for the line with the HTML comment; the tokens from that line stand alone. Here's some debug output with token types and the ranges (location + length):

{0,35}TokenType.blockHtml
  {0,6}TokenType.lineHtml             <div>
    {0,1}TokenType.angleLeft
    {1,3}TokenType.textPlain
    {4,1}TokenType.angleRight
    {5,1}TokenType.textNl
  {6,4}TokenType.htmlCommentStart     <!--
  {10,14}TokenType.textPlain          " html comment "
  {24,3}TokenType.htmlCommentStop     -->
  {27,1}TokenType.textNl
  {28,7}TokenType.lineHtml            </div>
    {28,1}TokenType.angleLeft
    {29,1}TokenType.slash
    {30,3}TokenType.textPlain
    {33,1}TokenType.angleRight
    {34,1}TokenType.textNl
{35,0}TokenType.blockEmpty

Exceptions from the rule?

Oddly enough, when you have a multi-line HTML comment:

<!--
foo
-->

This is not at all recognized as BLOCK_HTML, but BLOCK_PARA instead, and it includes a PAIR_HTML_COMMENT.

Plus if you add empty newlines around "foo" here, the whole HTML_COMMENT_PAIR is dissolved again and you have multiple paragraph blocks, one including the start, one the stop token.

The text was updated successfully, but these errors were encountered:

fletcher · 2021-07-19T19:12:37Z

I would need to dig into some tests to address specific examples, but the main difference is that MultiMarkdown does not attempt to parse inside HTML blocks, as there is no point. HTML is copied verbatim and is not validated beyond that. HTML spans within Markdown blocks, which include HTML comments inside text paragraphs that would be matched as an HTML_COMMENT_PAIR. But an HTML comment inside an HTML block would just be treated as raw HTML and not processed any further.

Using MultiMarkdown as an HTML syntax highlighter will not be particularly useful. Basically you get HTML or HTML comment, and I would recommend both be styled the same to avoid confusion related to the issue you bring up.

(NOTE: HTML Blocks that include blank lines are treated differently, where the HTML is treated as separate blocks, and it is possible to have plain Markdown blocks inside the "wrapping" HTML. Such as:

<div>

<!-- comment -->

**foo**

</div>

)

DivineDominion · 2021-07-20T07:21:21Z

In the end, in a syntax highlighting context, it does make sense to react to HTML_COMMENT_START by consuming all tokens that follow until the end of file of HTML_COMMENT_STOP. (In that regard, it's a bit different from what a MMD-to-HTML conversion would do. And the convention to allow mixed Markdown and HTML by adding empty lines does make total sense.)

I'm a bit surprised that the tokenizer treats

<!--
foo
-->

as BLOCK_PARA, and

<!-- foo -->

as BLOCK_HTML, but that's mostly irrelevant anyway. I don't have any particular request to push, it just looked a bit inconsistent at times and I didn't know if this is intentional or by accident.

fletcher · 2021-07-20T13:04:58Z

The first example requires parsing of multiple lines to determine whether this is an html comment, and whether there is additional content inside the paragraph. The second does not.

DivineDominion · 2021-07-21T10:59:16Z

@fletcher Another data point:

<!-- test
```clang
opened fence 
-->
this is tokenized as fenced code block, too.

... the fencing inside the comment "block" affects all but the first line of text -- like any un-closed code fence does. I would expect the encompassing HTML comment to win, though, in the token level. What do you think? Is this a problem of the code that "applies" the token information instead?

fletcher · 2021-07-22T19:52:22Z

I'll have to dig in to see what it would take to change the parsing here. I agree that I would prefer this be a plain HTML comment, without a fenced block. Just need to see what the effort to change this would be.

DivineDominion · 2021-09-25T07:48:31Z

@fletcher After all this time, I can now imagine to try my hand at that and open a PR if you want to change things. Maybe it'd be useful if you first added some tests to prevent regressions on other complex cases, though? Your knowledge of edge cases in (M)MD parsing is unparalleled, so that's be helpful :)

fletcher · 2021-09-25T16:05:23Z

That's what the test suite is for.

I'm all for you taking a stab at this if you like, but this may be one of the more complicated parts of the parsing to try to modify if you're not familiar with it (MultiMarkdown, lemon, etc.).

I'm working on other MultiMarkdown software projects right now, so this is a bit further down the priority list for me.

fletcher closed this as completed Jul 20, 2021

fletcher reopened this Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

DivineDominion commented Jul 19, 2021 •

edited

Loading

fletcher commented Jul 19, 2021

DivineDominion commented Jul 20, 2021

fletcher commented Jul 20, 2021

DivineDominion commented Jul 21, 2021 •

edited

Loading

fletcher commented Jul 22, 2021

DivineDominion commented Sep 25, 2021

fletcher commented Sep 25, 2021

Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

Comments

DivineDominion commented Jul 19, 2021 • edited Loading

Observations

Exceptions from the rule?

fletcher commented Jul 19, 2021

DivineDominion commented Jul 20, 2021

fletcher commented Jul 20, 2021

DivineDominion commented Jul 21, 2021 • edited Loading

fletcher commented Jul 22, 2021

DivineDominion commented Sep 25, 2021

fletcher commented Sep 25, 2021

DivineDominion commented Jul 19, 2021 •

edited

Loading

DivineDominion commented Jul 21, 2021 •

edited

Loading