Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency? PAIR_HTML_COMMENT token not available in BLOCK_HTML #221

Open
DivineDominion opened this issue Jul 19, 2021 · 7 comments
Open

Comments

@DivineDominion
Copy link
Contributor

DivineDominion commented Jul 19, 2021

I noticed that when one uses HTML comments inside a regular paragraph, the token tree contains PAIR_HTML_COMMENT → HTML_COMMENT_START and eventually PAIR_HTML_COMMENT → HTML_COMMENT_STOP (the → denotes is-a-child)

This makes applying highlight colors to HTML comments a bit weird since the comment range is sometimes available as PAIR_HTML_COMMENT, sometimes not. Maybe something odd is underlying all this that you might want to know about @fletcher

Observations

When you put the HTML comment on its own line in an empty document or at the end of a document without a trailing newline, it's the same, e.g.:

BLOCK_PARA
    PAIR_HTML_COMMENT
        HTML_COMMENT_START
        TEXT_PLAIN ...
        HTML_COMMENT_STOP [after this is EOF]

But when the HTML comment spans the whole line and ends in a newline character, the token tree is:

BLOCK_HTML
    HTML_COMMENT_START
    TEXT_PLAIN ...
    HTML_COMMENT_STOP
    TEXT_NL
BLOCK_EMPTY  [here is EOF]

Note that this seems to be a special case, because the following in a Markdown document ...

<div>
<!-- html comment -->
</div>

... produces BLOCK_HTML with multiple LINE_HTML tokens, except for the line with the HTML comment; the tokens from that line stand alone. Here's some debug output with token types and the ranges (location + length):

{0,35}TokenType.blockHtml
  {0,6}TokenType.lineHtml             <div>
    {0,1}TokenType.angleLeft
    {1,3}TokenType.textPlain
    {4,1}TokenType.angleRight
    {5,1}TokenType.textNl
  {6,4}TokenType.htmlCommentStart     <!--
  {10,14}TokenType.textPlain          " html comment "
  {24,3}TokenType.htmlCommentStop     -->
  {27,1}TokenType.textNl
  {28,7}TokenType.lineHtml            </div>
    {28,1}TokenType.angleLeft
    {29,1}TokenType.slash
    {30,3}TokenType.textPlain
    {33,1}TokenType.angleRight
    {34,1}TokenType.textNl
{35,0}TokenType.blockEmpty

Exceptions from the rule?

Oddly enough, when you have a multi-line HTML comment:

<!--
foo
-->

This is not at all recognized as BLOCK_HTML, but BLOCK_PARA instead, and it includes a PAIR_HTML_COMMENT.

Plus if you add empty newlines around "foo" here, the whole HTML_COMMENT_PAIR is dissolved again and you have multiple paragraph blocks, one including the start, one the stop token.

@fletcher
Copy link
Owner

I would need to dig into some tests to address specific examples, but the main difference is that MultiMarkdown does not attempt to parse inside HTML blocks, as there is no point. HTML is copied verbatim and is not validated beyond that. HTML spans within Markdown blocks, which include HTML comments inside text paragraphs that would be matched as an HTML_COMMENT_PAIR. But an HTML comment inside an HTML block would just be treated as raw HTML and not processed any further.

Using MultiMarkdown as an HTML syntax highlighter will not be particularly useful. Basically you get HTML or HTML comment, and I would recommend both be styled the same to avoid confusion related to the issue you bring up.

(NOTE: HTML Blocks that include blank lines are treated differently, where the HTML is treated as separate blocks, and it is possible to have plain Markdown blocks inside the "wrapping" HTML. Such as:

<div>

<!-- comment -->

**foo**

</div>

)

@DivineDominion
Copy link
Contributor Author

In the end, in a syntax highlighting context, it does make sense to react to HTML_COMMENT_START by consuming all tokens that follow until the end of file of HTML_COMMENT_STOP. (In that regard, it's a bit different from what a MMD-to-HTML conversion would do. And the convention to allow mixed Markdown and HTML by adding empty lines does make total sense.)

I'm a bit surprised that the tokenizer treats

<!--
foo
-->

as BLOCK_PARA, and

<!-- foo -->

as BLOCK_HTML, but that's mostly irrelevant anyway. I don't have any particular request to push, it just looked a bit inconsistent at times and I didn't know if this is intentional or by accident.

@fletcher
Copy link
Owner

The first example requires parsing of multiple lines to determine whether this is an html comment, and whether there is additional content inside the paragraph. The second does not.

@DivineDominion
Copy link
Contributor Author

DivineDominion commented Jul 21, 2021

@fletcher Another data point:

<!-- test
```clang
opened fence 
-->
this is tokenized as fenced code block, too.

... the fencing inside the comment "block" affects all but the first line of text -- like any un-closed code fence does. I would expect the encompassing HTML comment to win, though, in the token level. What do you think? Is this a problem of the code that "applies" the token information instead?

@fletcher
Copy link
Owner

I'll have to dig in to see what it would take to change the parsing here. I agree that I would prefer this be a plain HTML comment, without a fenced block. Just need to see what the effort to change this would be.

@fletcher fletcher reopened this Jul 22, 2021
@DivineDominion
Copy link
Contributor Author

@fletcher After all this time, I can now imagine to try my hand at that and open a PR if you want to change things. Maybe it'd be useful if you first added some tests to prevent regressions on other complex cases, though? Your knowledge of edge cases in (M)MD parsing is unparalleled, so that's be helpful :)

@fletcher
Copy link
Owner

That's what the test suite is for.

I'm all for you taking a stab at this if you like, but this may be one of the more complicated parts of the parsing to try to modify if you're not familiar with it (MultiMarkdown, lemon, etc.).

I'm working on other MultiMarkdown software projects right now, so this is a bit further down the priority list for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants