-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source mapping in AST #4565
Comments
I'm sure there is a pandoc issue about this somewhere, but all I could find was atom-community/markdown-preview-plus#106 and commonmark/commonmark-spec#57 |
Discussion in Google group https://groups.google.com/forum/#!msg/pandoc-discuss/tVe1RapDN5U/YGJE76FB74QJ Does not look very optimistic though... =) |
Adding source locations for everything would require a lot of
fundamental changes in the readers, and in the AST, so it's not at all a
trivial change.
Adding source locations for block-level elements would be a bit
easier; they could all be wrapped in Divs with source position
attributes. Not sure if this would be useful, though.
The pure Haskell commonmark parser I'm working on (jgm/commonmark-hs),
which may some day be the basis of pandoc's Markdown parser, tracks
detailed source positions for everything.
|
It would be useful. Example: editor and preview position synchronization for two-pane editors a-la https://dillinger.io/ -- any source position information would be useful for that. |
In the context of adding attributes to all elements, I think mpickering somewhere suggested to use polymorphic types like |
Hmmmm, not sure how much use the polymorphic types would see, my guess is they would only be used for a handful of things which would mean it's easier to just have some more concrete types. But I would be for having attributes attached to more of the AST. Also would like to see this done as an incremental change, so perhaps something like (1) add attributes for all I think though, it would be nice to turn on source attributes "globally" in some sense, without having to worry about "is it Markdown sources? which blocks do we do it for?". To that end, I can think of a few options:
I don't know thought, not sure we should halt all progress on this to find the perfect solution. Might be better to incrementally work towards it with something that fits the use cases of those who speak up. |
AST changes tend to be extremely painful for everyone, so generally 👍 for polymorphic types. "Incrementally" changing the AST doesn't sound like such a good idea. |
Well the incremental part wouldn't be changing the AST, just incrementally changing the parsers to actually insert the source position attributes. The two changes I proposed to the ASTs could be taken or left, either way you'll have to change the parsers at some point. The ideal situation would be to have a single place where we can throw the "turn on source position attributes", but I don't think the current architecture of the Readers has that common entry point for all readers. |
My point is, |
Yeah, I guess those were just a couple random ideas, I certainly don't think they are the solution to go with immediately. Sorry that I distracted the conversation with them. Basically, what I'm trying to get at is that perhaps we should go with a "good-enough" solution for the various use-cases people have asked about, and worry about a general solution as we see more use cases. To that end, if there are more changes that should be added to #4659 directly to satisfy more use-cases without performing a larger surgery on the reader framework/the ADTs, perhaps we should get them in now. It could very well be that people are happy with the "good-enough" for a while, giving more time for this discussion. |
Just thought I would chime in with another use case. I'm interested in using some grammar checking tools on documents in different markup formats. Feeding in the markup directly generates a lot of spurious errors. It will really help to convert to plain text with pandoc and then use a source mapping to know where in the original text the issue occurred, |
@michaelmior which constructs are you specifically interested in having source mapping for? Or just all constructs in the document? |
I'd like to be able to take a line and column number from plain text output and map it back to a line and column number in the input. Since it's pretty well-adopted, it would be ideal if pandoc could have the option to print out a source map in the same format used by web browsers to map from compiled/minified CSS and JS to the original source. |
This is a duplicate of #3809 but I guess this issue here has more info :) |
The following is copied from #5461, which was closed as a duplicate of this issue. I am including it here for the sake of discussion since it has some scenarios and an example implementation that should be considered in parallel with other suggestions here as this feature is actually developed. The ScenarioI'm tired of using Regular Expressions to try to parse Markdown. I run a publishing house where the canonical version of all our source material is stored in Markdown, and in the process of editing, translating, and publishing our materials there are a ton of operations that need to be done on the source files that require some level of contextual understanding. As our internal tooling is starting to look more and more like an IDE, more and more of our scripts are having to make guesses about lines in our source files. The most frustrating question to answer reliably is probably this one: Is line N inside a blockquote? a list item? a fenced code block? a div? a header? some nested combination of blocks? This question is surprisingly difficult to answer using general purpose stream editors and text manipulation tools. It isn't very hard to whip a RegEx that takes a wild guess, but it also isn't very hard to start running into edge cases where that guess is wildly off. Inline context is much easier to parse with RegEx, but still error prone. Is word Y inside emphasis markup? an inline span with a language attribute? The RequestImplement an optional Consider input file foo
> bar I would want to return something like this: $ pandoc --source-location -t json | jq
{
"blocks": [
{
"t": "Para",
"l": {
"l": 1,
"c": 1,
"b": 1
},
"c": [
{
"t": "Str",
"l": {
"l": 1,
"c": 1,
"b": 1
},
"c": "foo"
}
]
},
{
"t": "BlockQuote",
"l": {
"l": 3,
"c": 1,
"b": 6
},
"c": [
{
"t": "Para",
"l": {
"l": 3,
"c": 3,
"b": 8
},
"c": [
{
"t": "Str",
"l": {
"l": 3,
"c": 3,
"b": 8
},
"c": "bar"
}
]
}
]
}
],
"pandoc-api-version": [
1,
17,
5,
4
],
"meta": {}
} What could this be used for?My use case is for the Markdown reader, but I suspect any plain text input format would find benefits for this.
|
A couple problems for implementing this currently:
|
Thanks for the feedback @jgm.
|
@jgm: Perhaps the situation is more complicated than or different from what I understand in your comment. Based on what I do understand, I wonder whether you might just save the column index of the truncation for each line in the block before the recursive call to the parser. Then could you correct the results tree by applying the per-line column shift by the saved offset for that the particular line? More generally, could you not in other cases similarly save relevant information to reverse the effect of the transformation on the node information? |
Yes, it's not impossible. However, I don't feel like overhauling the markdown reader to add this. |
One more use case I'm investigating right now: extracting comments from the output document and matching them to lines in the source in markdown. In my case, that would solve two problems:
If there's a good first task for contributors or some relatively straightforward chore that could be chipped off the "integrating In the meantime, is this the right task to subscribe to be notified about commonmark-hs integration? |
commonmark-hs has already been integrated; it is now used in the As for source positions, changing one line in the commonmark reader would add Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls would change to Right (Cm bls :: Cm SourcePos Blocks) -> return $ B.doc bls I think. I'll try it. Anyway, it would then just remain to add some mechanism for enabling and disabling this feature. |
Small mistake in the above. The real diff is: diff --git a/src/Text/Pandoc/Readers/CommonMark.hs b/src/Text/Pandoc/Readers/CommonMark.hs
index c1773eaab..a5e9f99c6 100644
--- a/src/Text/Pandoc/Readers/CommonMark.hs
+++ b/src/Text/Pandoc/Readers/CommonMark.hs
@@ -35,7 +35,7 @@ readCommonMark opts s = do
commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s
case res of
Left err -> throwError $ PandocParsecError s err
- Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls
+ Right (Cm bls :: Cm SourceRange Blocks) -> return $ B.doc bls
where After that change, we get
Note: because pandoc doesn't have a slot for attributes on every AST element, this requires the insertion of lots of Spans and Divs to hold the source position attributes. HTML output: <div data-pos="input@1:1-2:1">
<p><span data-pos="input@1:1-1:3">Hi</span><span data-pos="input@1:3-1:4"> </span><span data-pos="input@1:4-1:11"><em><span data-pos="input@1:5-1:10">there</span></em></span></p>
</div>
<div data-pos="input@3:1-4:1">
<blockquote>
<div data-pos="input@3:3-4:1">
<p><span data-pos="input@3:3-3:9">foobar</span></p>
</div>
</blockquote>
</div> |
Note also that even if you specify a file name, it will say |
The upshot is that we already have everything in place to add source position attributes to the AST for commonmark input. Perhaps a Note that the commonmark-hs package also has a module that allows generating a source map, separate from the AST. I will now look into the code changes that would be required to integrate that. |
External source map gives this sort of information:
That is, it tells you, for each location, which elements start (+) or end (-). diff --git a/src/Text/Pandoc/Readers/CommonMark.hs b/src/Text/Pandoc/Readers/CommonMark.hs
index c1773eaab..fcffba283 100644
--- a/src/Text/Pandoc/Readers/CommonMark.hs
+++ b/src/Text/Pandoc/Readers/CommonMark.hs
@@ -20,22 +20,36 @@ import Commonmark
import Commonmark.Extensions
import Commonmark.Pandoc
import Data.Text (Text)
-import Text.Pandoc.Class.PandocMonad (PandocMonad)
+import Text.Pandoc.Class.PandocMonad (PandocMonad, trace)
import Text.Pandoc.Definition
import Text.Pandoc.Builder as B
import Text.Pandoc.Options
import Text.Pandoc.Error
import Control.Monad.Except
import Data.Functor.Identity (runIdentity)
+import qualified Data.Text as T
+import qualified Data.Map as M
+import qualified Data.Sequence as Seq
-- | Parse a CommonMark formatted string into a 'Pandoc' structure.
readCommonMark :: PandocMonad m => ReaderOptions -> Text -> m Pandoc
readCommonMark opts s = do
- let res = runIdentity $
- commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s
+ let res = runWithSourceMap <$> runIdentity
+ (commonmarkWith (foldr ($) defaultSyntaxSpec exts) "input" s)
case res of
Left err -> throwError $ PandocParsecError s err
- Right (Cm bls :: Cm () Blocks) -> return $ B.doc bls
+ Right ((Cm bls) :: Cm () Blocks, SourceMap sourceMap)
+ -> do
+ let renderStartEnd :: (Seq.Seq Text, Seq.Seq Text) -> Text
+ renderStartEnd (starts, ends) =
+ (foldMap (T.cons '+') starts)
+ <> (foldMap (T.cons '-') ends)
+ trace $
+ mconcat $ map (\(pos, startEnds) ->
+ T.pack (show (SourceRange [(pos,pos)])) <> T.pack " " <>
+ renderStartEnd startEnds <> T.pack "\n")
+ $ M.toList sourceMap
+ return $ B.doc bls
where
exts = [ (hardLineBreaksSpec <>) | isEnabled Ext_hard_line_breaks opts ] ++
[ (smartPunctuationSpec <>) | isEnabled Ext_smart opts ] ++ This emits the source map info to stdin if |
For anyone passing by this thread, this feature got released in 2.11.3 One thing to note:
To get a raw AST mapping, try I tried unzipping .docx rendered by pandoc and there's no trace of AST attributes there, so I take it .docx doesn't accept attributes. I can't think of a general solution to that, but if I do, I'll make sure to create a separate issue. @jgm Thanks for the good work! |
Great that this is implemented now for Markdown, but the issue title and description are not constrained to Markdown (OP even mentions different formats). So I'm not sure why this issue was closed, or why it's tagged |
The feature could in principle be implemented for other readers, but it would require significant manual modification, and in some cases, accurate source position information won't be possible due to parsing methods used. I'd prefer not to leave this open, as it's just too big an issue. More focused issues could be opened, e.g. "Add sourcepos extension support to the LaTeX reader", "Render sourcepos attributes in the DocBook writer", or "Give source positions a first class representation in the AST rather than using attributes." But I'd prefer to have issues that both address specific needs that users have and can reasonably be completed. |
I see that using |
Sorry, I think it is doing LaTeX locations. I just need to figure out how to interpret them... |
I installed pandoc 3.1.3 to use --from commonmark+sourcepos, unfortunately the ast/json differs from -from markdown for code blocks. The code block type is slightly wrong and the extra information is lost when position information is added. Given this markdown: with code block header {.xml c=1, d=2, UUID="viwjdjdjdj" } <?xml version="1.0" encoding="UTF-8"?>
<message>
<warning>
Hello World
</warning>
</message> the old json would look like :
while the new version looks like this:
|
Hi!
First of all, thank you for your great tool!
Being flexible and modular, it sometimes can be used in non-dedicated manner. For example, AST filters can be used to serialize AST somewhere, which allows just using Pandoc as markup language parser where needed.
For this purpose, is it difficult to add source file coordinates to AST or, maybe, to separate sourcemap-like file? For some formats, like M$ Word, they (coordinates within XML) will not be very informative, but for Human-readable formats it can be really great feature.
The text was updated successfully, but these errors were encountered: