Skip to content

Filter or remove rules to filter/remove by regexp/wildcard #423

@Flashwalker

Description

@Flashwalker

Can we have filter or remove rules to filter/remove via regexp or wildcard???

E.g.:

1.

Zero width space and/or Non-breaking space:
<a href="https://bla-bla-bla">&ZeroWidthSpace;&ZeroWidthSpace;</a>text-text-text produce:

[​​](https://bla-bla-bla)text-text-text

Is there any way to filter out (remove) html with zero visual content?
Something like:

turndownService.addRule('al_spaces', {
    regexFilter: '<[^<>]+?>[[:space:]]<\/[^<>]+?>',
    replacement: function (content) {
        return ''
    }
})

List of spaces for reference:

Number Character name
\u0020 space
\u00A0 no-break space
\u1680 Ogham space mark
\u180E Mongolian vowel separator
\u2000 en quad
\u2001 em quad
\u2002 en space (nut)
\u2003 em space (mutton)
\u2004 three-per-em space (thick space)
\u2005 four-per-em space (mid space)
\u2006 six-per-em space
\u2007 figure space
\u2008 punctuation space
\u2009 thin space
\u200A hair space
\u200B zero width space
\u202F narrow no-break space
\u205F medium mathematical space
\u3000 ideographic space
\uFEFF zero width no-break space
\uFFFC object replacement Character

2.

Line break which breaks markdown's markup:
<strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text produce:

**bla-bla-bla
** 
text-text-text

Is there any way to filter out (remove) all line breaks that precedes the closing tag?
Something like:

turndownService.removeAllBefore('<br>', '</*>')

Here is regex examples:

Remove the anchor with zero-width spaces (you can't see them until you paste it in dev console):

selectedHTML='<i>bla</i><b><a href="https://bla-bla-bla">​​​​​​​</a>text-text-text</b><i>bla</i>'
selectedHTML.replace(/<[^<>]+?>[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF\u0020\uFFFC]+<\/[^<>]+?>/gm, '')

Remove the line break that precedes closing tag:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/(<br ?\/?>)+(<\/[^<>]+?>)/gi, '$2')

Swap the line break that precedes closing tag and the closing tag with:

selectedHTML='<i>bla</i><strong>bla-bla-bla<br></strong>&nbsp;<br>text-text-text<i>bla</i>'
selectedHTML.replace(/((<br ?\/?>)+)(<\/[^<>]+?>)/gi, '$3$1')

It would be nice if regex filter will skip the content of code and pre tags.

P.S.
And also:

// Drop anchor html tags which contains only dots, commas
selectedHTML = '<a href="#">,</a>'
selectedHTML.replace(/<a [^<>]+?>[.,]+<\/a>/gim, '')

And

// Drop emoji images, keep emoji unicode (from alt attr)
selectedHTML = '<img src="img-apple-64/1f914.png" class="emoji" alt="🤔">'
selectedHTML.replace(/<img [^<>]+?alt=['"]([\p{Emoji}\u200d]+)['"][^<>]*?\/?>/gimu, '$1')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions