Add Custom Markdown Parser Block and Inline Injection #3918
Replies: 34 comments 1 reply
-
No response from the community. ping @nmetulev @Odonno @IbraheemOsama |
Beta Was this translation helpful? Give feedback.
-
This issue seems inactive. Do you need help to complete this issue? |
Beta Was this translation helpful? Give feedback.
-
Can we close it? The PR is merged and the related issue is closed. |
Beta Was this translation helpful? Give feedback.
-
@WilliamABradley can you provide an example using this? |
Beta Was this translation helpful? Give feedback.
-
From what I see, that this issue wasn't handled in the referenced PR. |
Beta Was this translation helpful? Give feedback.
-
I need this and am happy to have a crack at writing it if no one is making progress on it? |
Beta Was this translation helpful? Give feedback.
-
This would be what I'm currently searching for. |
Beta Was this translation helpful? Give feedback.
-
@WilliamABradley seems like you haven't done much on this lately? Did you get started at all and have some code to share in a branch? @knightmeister @LokiMidgard if either of you or both would want to take a poke at this, please feel free to jump in! The Markdown control is pretty popular, so I know this would be a welcome edition. |
Beta Was this translation helpful? Give feedback.
-
@michael-hawker I'm not sure if I'll come to this soon. However I've made some thoughts about this feature I would like to share. ScopeWhat can be extended Markdown has multiple ways to declare different kind of elements.
I only looked shortly at the source code, But if I understood it correctly Inline parsing is very modular designed. Every InlineElement seems to define characters that may start the inline element. And when a character matches the parse logic is in the inline element itself (which can fail/return If there was code that would dynamically compose those different InlineElements, new custom inline elements could added quite easy I think. With only the problem of ordering. This would allow The only anomaly is parsing links. The ParseInlineChildren method has an parameter to suppress parsing links. This is only set by Would we need a more generic way, so when parsing inline's the caller can define which inline's should not be considered? For blocks this is not the case. The Logic if something could be a Block is in the Document parse method. However this should be changeable. Blocks and inline's have a Do we need an additional Property to specify the Type if Three types of blocks need special consideration since they result in little different behavior then the rest of the blocks. Quotes, underline style header and (YAML) header. Can custom blocks styles have the same or similar behavior then those build in styles? The underline style header will change the already parsed paragraph. The last line will be used as the Block and the text before that line will become a paragraph block. Should we allow an underline style to manipulate more then the last line?
Block Quotes can be nested inside another and the Document parse logic will first determent the indentation of the quotes. And find out when the quote finishes. Some blocks also need the indention to work correctly. Can different quote blocks be nested?
YAML header block is handled differently. And I must say I haven't figured out why. Normally we try to parse one block type after another. This is done in a loop
After we found a block we fall through to the end putting the block in the collection and possible emitting an additional paragraph block for those text we got before but is not part of the current block. And then we go back to the start of the loop. The YAML header is the first block that is checked in the loop. If it matches we delete all previously text that is not yet part of a block. (Normally that would become a paragraph.) Then the Did I miss something with YAML? Otherwise we don't need a special handling. Every Block that could not be parsed as any block will default to Paragraph. (I think that's the only way to get a paragraph block) Should we allow to define what block will be used as default? Should the out of the box provided blocks and inline's use the same mechanic as the custom ones? Should those be removable? API Surface@WilliamABradley proposed using events that are called when parsing happens. I think I would tend to use Interfaces or Abstract classes that have Methods to Parse and return an Inline or Block. I think this would make it more straight forward to extend this Parser with Library's. Looking at the public abstract AbstractInlineFactory {
public abstract InlineParseResult Parse(string markdown, int start, int maxEnd);
/// <summary>
/// Returns the chars that if found means we might have a match.
/// </summary>
protected abstract IEnumerable<char> GetTripChars();
public void AddTripChars(List<InlineTripCharHelper> tripCharHelpers){
foreach(var c in this.GetTripChars())
tripCharHelpers.Add(new InlineTripCharHelper() { FirstChar = c, Method = this });
}
public virtual IEnumerable<Type> DefaultBeforeFactorys => Array.Empty<Type>();
public virtual IEnumerable<Type> DefaultAfterFactorys => Array.Empty<Type>();
} The two additional methods For Blocks it looks similar. public abstract AbstractBlockFactory<T>
where T : MarkdownBlock {
/// <summary>
/// Parses a block.
/// </summary>
/// <param name="markdown"> The markdown text. </param>
/// <param name="startOfLine"> The location of the start of the line. </param>
/// <param name="maxEnd"> The location to stop parsing. </param>
/// <param name="quoteDepth"> The current nesting of quotes. </param>
/// <param name="actualEnd"> Set to the end of the block when the return value is non-null. </param>
/// <returns> A parsed block. Or <c>null</c> if the text can't parsed to this <c>T</c></returns>
public abstract T Parse(string markdown, int startOfLine, int maxEnd, IEnumerable<char> quoteDepth, out int actualEnd);
public abstract bool Consider(bool lineStartsNewParagraph, char firstChar, int firstCharPos, int startOfLine);
/// <summary>
/// If <c>true</c> and Consider is also <c>true</c> Parse will get the previous line.
/// </summary>
public virtual bool IsUnderlineStyle => false;
public virtual IEnumerable<Type> DefaultBeforeFactorys => Array.Empty<Type>();
public virtual IEnumerable<Type> DefaultAfterFactorys => Array.Empty<Type>();
} The logic if a block should be tried to parse, that is currently in the Document would go into the I'm not sure if the TripChars are actually needed, we could consider returning The Parse could be configured like following var document = MarkdownDocument.BuildCustom()
.AddParser<AbbreviationInlineFactory>()
.Before<StrikethroughInlineFactory>()
.Before<CommentInlineFactory>()
.After<CodeInlineFactory>()
.AddParser<SampleBlockFactory>()
.RemoveParser<EmojiInlineFactory>()
.CreateDocument();
document.Parse(/*...*/); Using the Factory's as Generic argument will not allow to configure the Factory using the constructor. The idea is that a user will not write a flexible factory that he want to use multiple times with different configurations. Not using instances but generic types allows to remove factory's without having the correct instance. The Before and After Methods will influence the execution order. I'm not sure if those should replace the default ones defined by the factory or add to those. There should also be a check if the ordering is valid. If I also think the factory's should be supplied in the constructor or as in the sample with a builder pattern. Changing the Factory's after parsing seems strange. Instantiating the RendererI haven't made much thoughts about rendering yet. This is all thoughts until know. So there are probably some things that will not work as I currently thinking. |
Beta Was this translation helpful? Give feedback.
-
Thanks @LokiMidgard, appreciate the detailed breakdown and thought you've put into this. I especially like the custom parser factories being able to be injected within the specific order context. |
Beta Was this translation helpful? Give feedback.
-
I made some thoughts about those questions I wrote down. Here is my opinion on theses
I think yes. The parse method can get a enumeration of factorys that should be ignored. This would be used by the Link parsing to not parse links inside the brackets
I don't think so. Enum is nice, because you can check if your switch will handle all cases. But with having an string property that has the key I don't see any benefit over using the Type (class). I would actially Obsoleting the
I think it would be best to actually write all the default parsing in factorys and only use factorys while parsing. Bothe parse methods (for inline and blocks) would be instance methods on
Why not. This should not be hard. I think instead of the usage of the proposed
It is propably a litle more complex (then the already complex) quote logic. But unless its to hard to write we should do this. From Api the API ussage perspactive I think it would make sense to support it.
I don't think header need special handling as long we always pass the complete markdown string with start and end index.
I don't think its nessesary. But would be interrested if someone had a usecase.
yes! (I think I already said this earlier)
Yes. Build in ones may conflict with custom ones and could be replaced with an slight different implementation that does not make problems.
I currently think returning null would be enough. I don't know how much performance we lose when using more method calls. And how performance criticle this libary is. If performance is a concern maybe supporting incremental parsing would be a better Idea. But this would be a different issue. |
Beta Was this translation helpful? Give feedback.
-
I'm getting the impression that there must be some kind of configuration per factory. Eventually using an optional lambda for configuration: public DocumentBuilderConfigurator<TFactory> AddParser<TFactory>(Action<TFactory> configurationCallback = null)
where TFactory : AbstractBlockFactory, new() I hope that this API design will still emphasize that every Type should only be used once. And it is not allowed to use the same factory multiple times with different configurations. |
Beta Was this translation helpful? Give feedback.
-
I tried out to implement some extensible parsing. I created a pull request (#3047) so you can give feedback if the direction I'm heading is OK. I'll implemented only the Block Parsing for now. |
Beta Was this translation helpful? Give feedback.
-
I'm not happy with the quote depth or the proposed This could be a little unflexible.
Bothe would not work with an array of character's, since those would could change every line. Maybe having an Array of Delegates that will be called at the beginn of each line dynamicly checking the characters and returning the index where the sub text starts. Alternetivly extracting the quoted text (over all lines) and parsing it seperatly. That way sub quotes would just parse normaly. we would no longer need the quote depth. Drawback. for every quote we would allocate a new string. Or we would need a special structure that behaves like a string and can skip characters from an original string. |
Beta Was this translation helpful? Give feedback.
-
I'm currently considering the last, and will try to implement that. I will create a class called Later I would like to try to replace it with an Array of public char this[int index]
{
get
{
if (index < 0)
{
throw new IndexOutOfRangeException($"The provided {nameof(index)} ({index}) was < 0");
}
int position = 0;
for (int i = 0; i < spans.Length; i++)
{
if (index < position + spans[i].Length)
{
return spans[i].Span[position - index];
}
position += spans[i].Length;
}
throw new IndexOutOfRangeException($"The provided {nameof(index)} ({index}) was >= then the length ({position})");
}
} Also with most of the string methods. However, I'm not sure if this will actually be more performend then allocating a new string once and then just using the string methods. @michael-hawker Is it OK to add a reference to |
Beta Was this translation helpful? Give feedback.
-
Thanks for all the updates @LokiMidgard, I'll try and go through them with a bit of a closer eye to get caught up next week after the release. Will this change the For example, we use this in our Sample app to do some custom parsing/processing: https://github.com/windows-toolkit/WindowsCommunityToolkit/blob/master/Microsoft.Toolkit.Uwp.SampleApp/Controls/SampleAppMarkdownRenderer.cs |
Beta Was this translation helpful? Give feedback.
-
I'll have added two virtual methods to the renderer, these will be called if an unknown element is encountered. But I could revert this. I'm not happy with this solution. They are virtual so they break no one. |
Beta Was this translation helpful? Give feedback.
-
Finally had time to finish most of the stuff to get this finished. I have two things that need to be done
@michael-hawker can you point me in the right directions and what should be changed and what should stay. FirstHow can I test against previous windwos version SDKs? I only have one machine to test on? Seccond breaking chagesI'm not sure if I have breaking changes (I think I do). Some of the unit tests did failed before the refactoring. Now these "bug's" are fixed and text that previous resulted in one AST now results in another. There are multiple tests that would fail without the breaking change. Can I change/remove those tests? Or are the tests how it was intended and we want those breaking changes? I personally would like to keep the old behavior for those. A third option would be supporting both, After all we can now configure the Parser.^^ QuotesFailing Tests if behavior does not change
The problem is that blank line without a quote character in the middle of a quote should not split the quote.
The tests says this should parsed in ONE QuoteBlock containing 3 ParagraphBlocks and an additional ParagraphBlock after the QuoteBlock. The old implementation parsed a sequence of 3 QuoteBlocks containing one ParagraphBlock followed by a ParagraphBlock. The sample of MarkdownTextBlock suggests that the old behavior is correct:
MarkdownLinksTwo tests make problems here:
Neither I think every link that does not start with a schema should just pass. after all there could be a file named Space in DomainThe Uri class can't handle spaces in domain's. And I think they are not allowed, but I didn't find the correct specification. However the old implementation could handle an domain with space and replaced it with %20 (which Uri will also not accept). Currently two tests fail because the parser does not support spaces in domain. In this case I wouldn't mind this breaking change. Using the Uri class to check if an absolute URL is correct and what parts it has seems better then manually parsing the string. The later is just another surface for bugs. And if spaces are actually invalid for domains there should be no existing document where this would make problems. Of course if we would just allow anything to be parsed as an URL we wouldn't need to check using the Uri class. |
Beta Was this translation helpful? Give feedback.
-
@LokiMidgard See PR #1974 about the Quote Formatting |
Beta Was this translation helpful? Give feedback.
-
Thank you @WilliamABradley. That helped a lot. I updated the Quote tests. There is now one difference compared to the previous behavior.
Parses currently:
Old behavior was:
The current behavior is like CommonMark does it. And I think it make sense that putting quote characters before some text does not change the AST of the now quoted text. The Quote parser still has the code to switch the behavior, I just need to implement something that it knows it is currently inside another QuoteBlock so it will parse differently. If we don't need that I could delete a bunch of code that makes QuoteBlock a little more complex. |
Beta Was this translation helpful? Give feedback.
-
@michael-hawker Any Update? Can I remove the "draft" from the pull request? |
Beta Was this translation helpful? Give feedback.
-
I created a package in my github nuget package source and started using this. I hope I find some design flaws I've created when I refactored the code. |
Beta Was this translation helpful? Give feedback.
-
I head a new Idea how to prevent string operations that will allocate new strings. The implementation up to now uses the markdown string and also has for every method call index positions for start and end (and some more). This works well for most things. But not for the Quoteblock. That will allocate a new string to parse its contents. The original implementation also head a quote position to skip the first I like the A readonly ref struct that will expose lines of I started with a prototype of such a struct and will try to get it to work with the parsers. I hope this will not take to long to get it working again. Maybe some days (or a few weeks). |
Beta Was this translation helpful? Give feedback.
-
By the way, unlike |
Beta Was this translation helpful? Give feedback.
-
I made some performance test, comparing the original version to the current in the branch. BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Original : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
current : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Runtime=.NET Core 3.1
When I started, I was 18 times slower then the original. Now it's "only" around 3 times slower. My goal would be 1.0 but I'm not sure if I'll get much faster then now. The text I use is the sample text from the SampleApp. |
Beta Was this translation helpful? Give feedback.
-
Hey @LokiMidgard, sorry for the lack of response lately, there's been a lot going on since the holidays. I know you've just put a ton of work into these updates; however, we were just discussing recently, because of all the fiddliness and performance implications of Markdown, if it made sense to leverage a more well-tested implementation of a Markdown parser compared to our own custom one. Then just connect it to the renderer we have in the Toolkit for the MarkdownTextBlock (see Issue #3200). I'd love to hear your input on this new proposal. @MihaZupan from that project has also been helping us evaluate how our Toolkit scenarios can translate. I'm sure he understands the efforts you've gone through above to make sense of the Markdown specs. |
Beta Was this translation helpful? Give feedback.
-
@michael-hawker No problem. I can totally understand that decision. And I think it makes sense. When I made the performance tests and couldn't get near the original performance, I already questioned if it is a good idea to integrate this. Having a drastically performance drop to add a feature that probably many that use this library would not need, don't seems to be the correct choice. I will probably fork the current Parser, since what I need is a Markdown Parser that I can customize for my own Markdown Flavor. After all the reason I've done this wasn't altruism, but I needed such a feature ;) Your Project helped me a lot in getting where I am now with this, so thank you all for that. |
Beta Was this translation helpful? Give feedback.
-
Thanks @LokiMidgard for your understanding. Appreciate your input on the other thread as well, and I'm glad you've still found use from the work you've done. With everything in mind though, I think it makes sense to just close your PR #3047 for now until we know the full path forward? Do you have any problems with that? Thanks again! |
Beta Was this translation helpful? Give feedback.
-
No I closed it. |
Beta Was this translation helpful? Give feedback.
-
Transferring this to the discussion as we are not certain about this and requires more clarity. |
Beta Was this translation helpful? Give feedback.
-
I'm submitting a...
Description
Injection
I plan to work on this after #1650 is merged.
We are never going to cover everyone's Markdown Standards, so we should create a way to let the User define their own parsing Rules, Block, Inlines, Conditions, etc.
The current way I'm thinking of implementing this, is by having an event called ParsingBeginning, or something similar, in the Args for that event would be the Registered TripChars and ways of Identifying Blocks.
This way you can Add, Insert, and Remove registration for Block and Inline Creation. If, for some reason you want to disable List Block parsing, it could be done with this event.
You could then create your own Inlines to handle with your own custom logic and UI.
Rendering
To implement UI for this, new methods would be added to the MarkdownRendererBase.
Beta Was this translation helpful? Give feedback.
All reactions