Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source mapping in AST #4565

Closed
dluciv opened this issue Apr 19, 2018 · 32 comments
Closed

Source mapping in AST #4565

dluciv opened this issue Apr 19, 2018 · 32 comments

Comments

@dluciv
Copy link

dluciv commented Apr 19, 2018

Hi!

First of all, thank you for your great tool!

Being flexible and modular, it sometimes can be used in non-dedicated manner. For example, AST filters can be used to serialize AST somewhere, which allows just using Pandoc as markup language parser where needed.

For this purpose, is it difficult to add source file coordinates to AST or, maybe, to separate sourcemap-like file? For some formats, like M$ Word, they (coordinates within XML) will not be very informative, but for Human-readable formats it can be really great feature.

@mb21
Copy link
Collaborator

mb21 commented Apr 19, 2018

I'm sure there is a pandoc issue about this somewhere, but all I could find was atom-community/markdown-preview-plus#106 and commonmark/commonmark-spec#57

@dluciv
Copy link
Author

dluciv commented Apr 19, 2018

Discussion in Google group https://groups.google.com/forum/#!msg/pandoc-discuss/tVe1RapDN5U/YGJE76FB74QJ

Does not look very optimistic though... =)

@jgm
Copy link
Owner

jgm commented Apr 19, 2018 via email

@lierdakil
Copy link
Contributor

Adding source locations for block-level elements would be a bit
easier; they could all be wrapped in Divs with source position
attributes. Not sure if this would be useful, though.

It would be useful. Example: editor and preview position synchronization for two-pane editors a-la https://dillinger.io/ -- any source position information would be useful for that.

@mb21
Copy link
Collaborator

mb21 commented May 17, 2018

In the context of adding attributes to all elements, I think mpickering somewhere suggested to use polymorphic types like Codeblock a String. I guess then you'd be freer to store source map as any data type you want inside that a...

@ehildenb
Copy link

Hmmmm, not sure how much use the polymorphic types would see, my guess is they would only be used for a handful of things which would mean it's easier to just have some more concrete types. But I would be for having attributes attached to more of the AST. Also would like to see this done as an incremental change, so perhaps something like (1) add attributes for all Block elements as initial step, then (2) add extension like #4659 which only adds source attributes in a few places when the filter is turned on, then (3) as people request source attributes for more places in the AST and for more readers, then go through and make it happen.

I think though, it would be nice to turn on source attributes "globally" in some sense, without having to worry about "is it Markdown sources? which blocks do we do it for?". To that end, I can think of a few options:

  1. Add a new Block type called Comment. Probably somewhat orthogonal, but then people can put whatever directives they want in the comments, and it helps with literate programming (so that we can keep source text as well, which is my original use-case).

  2. Add a new Block type BlockLoc SourcePos Block, which just wraps a Block with the given source location information. Once again incrementally update the parsers to insert these when parsing blocks, but may be possible to do once per reader language at a Block level parser (instead of once per reader language per block constructor).

I don't know thought, not sure we should halt all progress on this to find the perfect solution. Might be better to incrementally work towards it with something that fits the use cases of those who speak up.

@lierdakil
Copy link
Contributor

AST changes tend to be extremely painful for everyone, so generally 👍 for polymorphic types. "Incrementally" changing the AST doesn't sound like such a good idea.

@ehildenb
Copy link

Well the incremental part wouldn't be changing the AST, just incrementally changing the parsers to actually insert the source position attributes. The two changes I proposed to the ASTs could be taken or left, either way you'll have to change the parsers at some point. The ideal situation would be to have a single place where we can throw the "turn on source position attributes", but I don't think the current architecture of the Readers has that common entry point for all readers.

@lierdakil
Copy link
Contributor

My point is, BlockLoc is obviously not the be-all and end-all solution (e.g. what do we do if we want source map attributes on inline elements?). So that would have to be changed at some point, which is a pain. Comment is not related to the discussion at all. If it was easy to paste source positions into the source Markdown, I'd do it ages ago via span/div -- the sad truth is one has to completely parse the source to do that.

@ehildenb
Copy link

Yeah, I guess those were just a couple random ideas, I certainly don't think they are the solution to go with immediately. Sorry that I distracted the conversation with them.

Basically, what I'm trying to get at is that perhaps we should go with a "good-enough" solution for the various use-cases people have asked about, and worry about a general solution as we see more use cases. To that end, if there are more changes that should be added to #4659 directly to satisfy more use-cases without performing a larger surgery on the reader framework/the ADTs, perhaps we should get them in now. It could very well be that people are happy with the "good-enough" for a while, giving more time for this discussion.

@michaelmior
Copy link

Just thought I would chime in with another use case. I'm interested in using some grammar checking tools on documents in different markup formats. Feeding in the markup directly generates a lot of spurious errors. It will really help to convert to plain text with pandoc and then use a source mapping to know where in the original text the issue occurred,

@ehildenb
Copy link

@michaelmior which constructs are you specifically interested in having source mapping for? Or just all constructs in the document?

@michaelmior
Copy link

I'd like to be able to take a line and column number from plain text output and map it back to a line and column number in the input. Since it's pretty well-adopted, it would be ideal if pandoc could have the option to print out a source map in the same format used by web browsers to map from compiled/minified CSS and JS to the original source.

@oli-obk
Copy link

oli-obk commented Aug 8, 2018

This is a duplicate of #3809 but I guess this issue here has more info :)

@alerque
Copy link
Contributor

alerque commented Apr 28, 2019

The following is copied from #5461, which was closed as a duplicate of this issue. I am including it here for the sake of discussion since it has some scenarios and an example implementation that should be considered in parallel with other suggestions here as this feature is actually developed.


The Scenario

I'm tired of using Regular Expressions to try to parse Markdown.

I run a publishing house where the canonical version of all our source material is stored in Markdown, and in the process of editing, translating, and publishing our materials there are a ton of operations that need to be done on the source files that require some level of contextual understanding. As our internal tooling is starting to look more and more like an IDE, more and more of our scripts are having to make guesses about lines in our source files. The most frustrating question to answer reliably is probably this one:

Is line N inside a blockquote? a list item? a fenced code block? a div? a header? some nested combination of blocks?

This question is surprisingly difficult to answer using general purpose stream editors and text manipulation tools. It isn't very hard to whip a RegEx that takes a wild guess, but it also isn't very hard to start running into edge cases where that guess is wildly off.

Inline context is much easier to parse with RegEx, but still error prone.

Is word Y inside emphasis markup? an inline span with a language attribute?

The Request

Implement an optional --source-location flag or similar that keeps track of byte offsets while reading input and adds this to the AST tree. This would probably include line, column, and overall input byte offset.

Consider input file test.md:

foo

> bar

I would want to return something like this:

$ pandoc --source-location -t json | jq
{
  "blocks": [
    {
      "t": "Para",
      "l": {
        "l": 1,
        "c": 1,
        "b": 1
      },
      "c": [
        {
          "t": "Str",
          "l": {
            "l": 1,
            "c": 1,
            "b": 1
          },
          "c": "foo"
        }
      ]
    },
    {
      "t": "BlockQuote",
      "l": {
        "l": 3,
        "c": 1,
        "b": 6
      },
      "c": [
        {
          "t": "Para",
          "l": {
            "l": 3,
            "c": 3,
            "b": 8
          },
          "c": [
            {
              "t": "Str",
              "l": {
                "l": 3,
                "c": 3,
                "b": 8
              },
              "c": "bar"
            }
          ]
        }
      ]
    }
  ],
  "pandoc-api-version": [
    1,
    17,
    5,
    4
  ],
  "meta": {}
}

What could this be used for?

My use case is for the Markdown reader, but I suspect any plain text input format would find benefits for this.

  • LaTeX→PDF has a tool callede SyncTeX that generates PDFs that can be used in a "preview" type output scenario and clicking on any element can take you back to the line in the source TeX file. This is great boon for IDE's. I would like to do this with Markdown and other source document formats as well without re-inventing the wheel.
    • I would like to build an EPUB reader on our website that people can report translation issues or even be linked to an online editor with the source markdown for that paragraph open to contribute a fix.
    • I would like to print draft layouts that include a footer that shows the source file and line number range represented on each page so that proof readers can easily jump to the right context to make changes. Etc.
  • Syntax highlighters often struggle to get Pandoc's more advanced syntaxes sorted out and being able to check the context of a region by inspecting the AST tree would be a huge boon.
  • Linters could be implemented as pandoc filters, using Pandoc as the parser to get the context right, but returning output that can be used in the context on an editor.
  • Spell checkers
  • Natural language analysis tools (like grammar checkers or translation tools).
  • ...
  • This would also solution for Better encoding error messages #1417.

@jgm
Copy link
Owner

jgm commented Apr 28, 2019

A couple problems for implementing this currently:

  1. You request byte offset. Pandoc has access to source line and column, but not byte offset (which of course depends on the encoding). Of course, byte offset could be computed if we had source line and column. Character offset might be more useful though in general.

  2. Pandoc's current markdown parsing strategy doesn't always allow us to give accurate source positions. For example, when we parse a block quote, we strip off the initial > or > from each line, then reparse the result. Since we may have stripped off different numbers of characters from each line (including 0 characters with "lazy" continuations), we've lost the information we need for accurate source positions. This can be fixed when we integrate commonmark-hs later, since it gives 100% accurate source positions.

@alerque
Copy link
Contributor

alerque commented Apr 29, 2019

Thanks for the feedback @jgm.

  1. A character offset would suffice. Given the encoding, the byte offset could be computed anyway; and realistically I'd make use of the character offset far more often and only suggested the byte offset so that could be computed.

    But it is interesting you say the source line and column is accessible already. Even given the inaccuracies of the column value due to the strip/reparse cycle, even the line offset would help identify what kind of block nesting a given inline is wrapped in, no? Is there a way to output this? (I'm using patched versions of Pandoc anyway so don't mind bolting something on pending progress on ②.

  2. I'm following commonmark-hs with considerable interest. I wish my Haskell chops weren't such weak sauce and I could help.

@fmoralesc
Copy link

@jgm, @alerque For the needs of vim-pandoc-syntax even source line number would be a vast improvement.

@brainchild0
Copy link

Pandoc's current markdown parsing strategy doesn't always allow us to give accurate source positions. For example, when we parse a block quote, we strip off the initial > or > from each line, then reparse the result. Since we may have stripped off different numbers of characters from each line (including 0 characters with "lazy" continuations), we've lost the information we need for accurate source positions.

@jgm: Perhaps the situation is more complicated than or different from what I understand in your comment. Based on what I do understand, I wonder whether you might just save the column index of the truncation for each line in the block before the recursive call to the parser. Then could you correct the results tree by applying the per-line column shift by the saved offset for that the particular line? More generally, could you not in other cases similarly save relevant information to reverse the effect of the transformation on the node information?

@jgm
Copy link
Owner

jgm commented May 14, 2020

Yes, it's not impossible.

However, I don't feel like overhauling the markdown reader to add this.
The commonmark parser in commonmark-hs, which I'll be integrating into pandoc, already has complete source position information.

@v4dkou
Copy link

v4dkou commented Dec 9, 2020

One more use case I'm investigating right now: extracting comments from the output document and matching them to lines in the source in markdown.

In my case, that would solve two problems:

  1. Working on docx comments, that can come from Word/Google Docs, when we work with external personnel (i.e. lawyers) or customers who are not familiar with Markdown-based workflows.
  2. Working on comments that come from HTML/JIRA/etc. targets, when we work with internal personnel. Although we can probably teach people to leave comments in the source, instead of right where they usually read the document, I feel like this extra step will discourage people from leaving minor, but accumulatively substantial comments.

If there's a good first task for contributors or some relatively straightforward chore that could be chipped off the "integrating commonmark-hs and implement source mapping" story, I would gladly pick it up.

In the meantime, is this the right task to subscribe to be notified about commonmark-hs integration?
#4535

@jgm
Copy link
Owner

jgm commented Dec 9, 2020

commonmark-hs has already been integrated; it is now used in the commonmark reader.
A good number of pandoc extensions have been written for commonmark-hs, but we're still missing some key ones (citations, some of the table formats, example lists). Unfortunately these are also the trickiest to implement.

As for source positions, changing one line in the commonmark reader would add data-source-pos attributes to all elements.

    Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls

would change to

    Right (Cm bls :: Cm SourcePos Blocks) -> return $ B.doc bls

I think. I'll try it. Anyway, it would then just remain to add some mechanism for enabling and disabling this feature.

@jgm
Copy link
Owner

jgm commented Dec 9, 2020

Small mistake in the above. The real diff is:

diff --git a/src/Text/Pandoc/Readers/CommonMark.hs b/src/Text/Pandoc/Readers/CommonMark.hs
index c1773eaab..a5e9f99c6 100644
--- a/src/Text/Pandoc/Readers/CommonMark.hs
+++ b/src/Text/Pandoc/Readers/CommonMark.hs
@@ -35,7 +35,7 @@ readCommonMark opts s = do
               commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s
   case res of
     Left err -> throwError $ PandocParsecError s err
-    Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls
+    Right (Cm bls :: Cm SourceRange Blocks) -> return $ B.doc bls
  where

After that change, we get

% stack exec pandoc -- -t native -f commonmark
Hi *there*

> foobar
^D
[Div ("",[],[("data-pos","input@1:1-2:1")])
 [Para [Span ("",[],[("data-pos","input@1:1-1:3")]) [Str "Hi"],Span ("",[],[("data-pos","input@1:3-1:4")]) [Space],Span ("",[],[("data-pos","input@1:4-1:11")]) [Emph [Span ("",[],[("data-pos","input@1:5-1:10")]) [Str "there"]]]]]
,Div ("",[],[("data-pos","input@3:1-4:1")])
 [BlockQuote
  [Div ("",[],[("data-pos","input@3:3-4:1")])
   [Para [Span ("",[],[("data-pos","input@3:3-3:9")]) [Str "foobar"]]]]]]

Note: because pandoc doesn't have a slot for attributes on every AST element, this requires the insertion of lots of Spans and Divs to hold the source position attributes. HTML output:

<div data-pos="input@1:1-2:1">
<p><span data-pos="input@1:1-1:3">Hi</span><span data-pos="input@1:3-1:4"> </span><span data-pos="input@1:4-1:11"><em><span data-pos="input@1:5-1:10">there</span></em></span></p>
</div>
<div data-pos="input@3:1-4:1">
<blockquote>
<div data-pos="input@3:3-4:1">
<p><span data-pos="input@3:3-3:9">foobar</span></p>
</div>
</blockquote>
</div>

@jgm
Copy link
Owner

jgm commented Dec 9, 2020

Note also that even if you specify a file name, it will say input@ because of the way pandoc concatenates the inputs. Changing that would require modifying the readers to take a list of (FilePath, Text) pairs instead of just a Text as input. Kind of a big project.

@jgm
Copy link
Owner

jgm commented Dec 9, 2020

The upshot is that we already have everything in place to add source position attributes to the AST for commonmark input. Perhaps a --sourcepos option should be added.

Note that the commonmark-hs package also has a module that allows generating a source map, separate from the AST. I will now look into the code changes that would be required to integrate that.

@jgm
Copy link
Owner

jgm commented Dec 9, 2020

External source map gives this sort of information:

[trace] input@1:1-1:1 +rawBlock
input@5:1-5:1 -rawBlock
input@6:1-6:1 +heading1
input@6:3-6:3 +str
input@6:9-6:9 -str
input@7:1-7:1 -heading1
input@8:1-8:1 +paragraph+link
input@8:2-8:2 +image
input@8:4-8:4 +str
input@8:10-8:10 -str
input@9:1-9:1 +str
input@9:8-9:8 -str
input@9:85-9:85 -image
input@9:126-9:126 -link
input@10:1-10:1 +link
input@10:2-10:2 +image
input@10:4-10:4 +str

That is, it tells you, for each location, which elements start (+) or end (-).
Proof of concept diff:

diff --git a/src/Text/Pandoc/Readers/CommonMark.hs b/src/Text/Pandoc/Readers/CommonMark.hs
index c1773eaab..fcffba283 100644
--- a/src/Text/Pandoc/Readers/CommonMark.hs
+++ b/src/Text/Pandoc/Readers/CommonMark.hs
@@ -20,22 +20,36 @@ import Commonmark
 import Commonmark.Extensions
 import Commonmark.Pandoc
 import Data.Text (Text)
-import Text.Pandoc.Class.PandocMonad (PandocMonad)
+import Text.Pandoc.Class.PandocMonad (PandocMonad, trace)
 import Text.Pandoc.Definition
 import Text.Pandoc.Builder as B
 import Text.Pandoc.Options
 import Text.Pandoc.Error
 import Control.Monad.Except
 import Data.Functor.Identity (runIdentity)
+import qualified Data.Text as T
+import qualified Data.Map as M
+import qualified Data.Sequence as Seq
 
 -- | Parse a CommonMark formatted string into a 'Pandoc' structure.
 readCommonMark :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
 readCommonMark opts s = do
-  let res = runIdentity $
-              commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s
+  let res = runWithSourceMap <$> runIdentity
+              (commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s)
   case res of
     Left err -> throwError $ PandocParsecError s err
-    Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls
+    Right ((Cm bls) :: Cm () Blocks, SourceMap sourceMap)
+             -> do
+                 let renderStartEnd :: (Seq.Seq Text, Seq.Seq Text) -> Text
+                     renderStartEnd (starts, ends) =
+                       (foldMap (T.cons '+') starts)
+                       <> (foldMap (T.cons '-') ends)
+                 trace $
+                   mconcat $ map (\(pos, startEnds) ->
+                     T.pack (show (SourceRange [(pos,pos)])) <> T.pack " " <>
+                           renderStartEnd startEnds <> T.pack "\n")
+                   $ M.toList sourceMap
+                 return $ B.doc bls
  where
   exts = [ (hardLineBreaksSpec <>) | isEnabled Ext_hard_line_breaks opts ] ++
          [ (smartPunctuationSpec <>) | isEnabled Ext_smart opts ] ++

This emits the source map info to stdin if --trace is used. I note that auto_identifiers seems to break when this is done; that needs looking into and may be a bug in commonmark-extensions.

@v4dkou
Copy link

v4dkou commented Dec 22, 2020

For anyone passing by this thread, this feature got released in 2.11.3

One thing to note:

The data-pos attributes are put on elements that accept attributes

To get a raw AST mapping, try
pandoc --from commonmark+sourcepos --to json -i test.md -o test.json
Source mapping attributes are also present in HTML
pandoc --from commonmark+sourcepos --to html -i test.md -o test.html

I tried unzipping .docx rendered by pandoc and there's no trace of AST attributes there, so I take it .docx doesn't accept attributes. I can't think of a general solution to that, but if I do, I'll make sure to create a separate issue.

@jgm Thanks for the good work!

@valentjn
Copy link

Great that this is implemented now for Markdown, but the issue title and description are not constrained to Markdown (OP even mentions different formats). So I'm not sure why this issue was closed, or why it's tagged format:Markdown.

@jgm
Copy link
Owner

jgm commented Dec 22, 2020

The feature could in principle be implemented for other readers, but it would require significant manual modification, and in some cases, accurate source position information won't be possible due to parsing methods used. I'd prefer not to leave this open, as it's just too big an issue. More focused issues could be opened, e.g. "Add sourcepos extension support to the LaTeX reader", "Render sourcepos attributes in the DocBook writer", or "Give source positions a first class representation in the AST rather than using attributes." But I'd prefer to have issues that both address specific needs that users have and can reasonably be completed.

@dmurdoch
Copy link

I see that using --from commonmark+sourcepos with --to latex changes the rendering (it adds braces around every word; that's helpful for Synctex), but I can't see how to get the source positions that correspond to particular locations. Is there some way to use --to json to give LaTeX locations instead of HTML locations?

@dmurdoch
Copy link

Sorry, I think it is doing LaTeX locations. I just need to figure out how to interpret them...

@nkh
Copy link

nkh commented Jun 7, 2023

@jgm

I installed pandoc 3.1.3 to use --from commonmark+sourcepos, unfortunately the ast/json differs from -from markdown for code blocks.

The code block type is slightly wrong and the extra information is lost when position information is added.

Given this markdown: with code block header {.xml c=1, d=2, UUID="viwjdjdjdj" }

<?xml version="1.0" encoding="UTF-8"?>
<message>
    <warning>
            Hello World
    </warning>
</message>

the old json would look like :

{
  "t": "CodeBlock",
  "c": [
    [
      "",
      [
        "xml"
      ],
      [
        [
          "c",
          "1,"
        ],
        [
          "d",
          "2,"
        ],
        [
          "UUID",
          "viwjdjdjdj"
        ]
      ]
    ],
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<message>\n    <warning>\n            Hello World\n    </warning>\n</message>"
  ]
},

while the new version looks like this:

{
  "t": "CodeBlock",
  "c": [
    [
      "",
      [
        "{.xml"
      ],
      [
        [
          "data-pos",
          "pasithee.md@299:1-307:1"
        ]
      ]
    ],
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<message>\n    <warning>\n            Hello World\n    </warning>\n</message>"
  ]
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests