Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed API: MathML, SVG support, plus localname case-handling. #103

Open
otherdaniel opened this issue Jun 30, 2021 · 49 comments
Open

Proposed API: MathML, SVG support, plus localname case-handling. #103

otherdaniel opened this issue Jun 30, 2021 · 49 comments
Assignees
Milestone

Comments

@otherdaniel
Copy link
Collaborator

Seperate issue, because this tries to pull together several existing issues in one.

The current problems are:

  • SVG + MathML are unspecified. That's a bit of an embarrasment.
  • Some spec items (like lower-casing) interact with that in odd ways.
  • The config getters were originally specified as returning a copy of the config as it was originally passed in, while feedback suggested we should return a normalized copy that mirrors what exactly the API will actually handle.

This mainly picks up SecurityMB's comment in #72, and tries to extend it to cover anything.

  • In all places where we accept element names -- allow lists or block lists, or elements in attribute allow/block lists -- we will accept pseudo-namespaced element names "html:name", "svg:name", "math:name", or "*:name". The latter can also be written as "name". That is, any namespace is the default.
  • The prefixes are fixed strings, and will not attempt to process XML Namespaces-style namespace declarations, or anything similar.
  • The *: variant matches any element with the given localName, the html:, svg:, math: variants match against the localname, iff the parser has placed the elements into the corresponding namespace. (Since now all methods that (implicitly) parse have an explicit context, this ought to work with a rather low surprise factor.)
  • The html: and *: variants match case in-sensitively. The config returns them lower-cased. The other two match case-sensitively and the config getter returns their names as-is.
  • The config getters will return all *: names without the prefix. (I.e., "*.script" becomes "script".)
  • The config getters will drop malformed strings. (Like: "wondernamespace:script".)

WDYT?


I find the case-handling based on the namespace to be slightly surprising, but this matches what HTML parser / DOM does. If we embrace that, I think we the rest of the API is fairly straight-forward.

@securityMB
Copy link

  • In all places where we accept element names -- allow lists or block lists, or elements in attribute allow/block lists -- we will accept pseudo-namespaced element names "html:name", "svg:name", "math:name", or "*:name". The latter can also be written as "name". That is, any namespace is the default.

After thinking about it for a little while, I see some problems with using : as a namespace separator:

  1. Even though nobody probably does that, : can be used in a tag name. For instance <abc:div> will create an element called abc:div in the DOM tree.
  2. The fact that a single string can contain both namespace and tag name makes it a little bit less convenient to type, for instance in TypeScript. If we used another syntax, for instance: ['html', 'name'] or {tagName:'name', namespace:'html'} (but this one is probably too verbose?), then IDEs could give suggestions for developers really easy.

Furthermore, I am not sure that any namespace should be the default. The last few bypasses of DOMPurify stemmed from the fact that some elements were created in unexpected namespace (like form in MathML namespace).

@otherdaniel
Copy link
Collaborator Author

Thanks, this is good feedback!

  • Colons in names: I don't think we should re-invent (or re-implement) "full" namespaces. I think we should pick a handful of prefixes with special meaning. Then, if all 3 namespaces browsers actually process have a prefix, one can specify even weird names with a colon in them. Doing so is awkward, but as that is probably a rather exceptional use case that'd be okay.

  • TypeScript / IDEs: I'm not sure I get this. Is the issue that by fusing the namespace into the string makes it opaque to type systems and IDEs? What's a case where that'd make a difference? (Off-hand, it seems that a structure with a namespace designator and a seperate local name would be pretty awkward to use, for little benefit in the majority case.)

  • 'any' as default: This is an excellent point. For a block-list my proposal makes sense, but for an allow-list... not so much. I wonder if HTML could be the default instead - given that that's probably the 99% use case - or if this means there is no meaningful default.

@Andrew-Cottrell
Copy link

Andrew-Cottrell commented Jul 2, 2021

I think "html:" as the default would be least surprising for a majority of developers; and also the safer option. It may not be necessary to have "html:" as an explicit option; the Sanitizer API could only have no prefix, "svg:", "math:", and "*:". I'm not sure "*:" is absolutely necessary either and might eventually become a lint target if overused (like specifying "*" for targetOrigin in the postMessage API).

It might be useful to be able to block "svg:*" and "math:*".

@securityMB
Copy link

  • Colons in names: I don't think we should re-invent (or re-implement) "full" namespaces. I think we should pick a handful of prefixes with special meaning. Then, if all 3 namespaces browsers actually process have a prefix, one can specify even weird names with a colon in them. Doing so is awkward, but as that is probably a rather exceptional use case that'd be okay.

Agreed on "full" namespaces. My argument was just about the separator that we use. I saw some tests in web platform that used space as a separator (for instance: svg animate means animate in SVG namespace) which seems fine, because you cannot have a space in tag name.

  • TypeScript / IDEs: I'm not sure I get this. Is the issue that by fusing the namespace into the string makes it opaque to type systems and IDEs? What's a case where that'd make a difference? (Off-hand, it seems that a structure with a namespace designator and a seperate local name would be pretty awkward to use, for little benefit in the majority case.)

Yeah, I'm not that entirely sure on importance of this one. My argument was that if you make a typo in the namespace, for instance: "svf:animate", then you'll see the mistake only on runtime. If you split the namespace and tag name (for instance: ["svf", "animate"]), then IDE can spot the mistake right away because the namespace part can be typed to 'svg' | 'html' | 'math'.

'any' as default: This is an excellent point. For a block-list my proposal makes sense, but for an allow-list... not so much. I wonder if HTML could be the default instead - given that that's probably the 99% use case - or if this means there is no meaningful default.

My proposal is to:

  • If the element name is well-known and has specific namespace(s), then apply its namespace(s). For instance, if someone says "form", then it is only correct in HTML namespace, so this is the default. "mglyph" is only in MathML namespace so it is also the default. "title" is in HTML and SVG so this is also the default.
  • If the element name is custom, then we assume HTML namespace by default. This seems sane, as I believe you cannot create custom elements in other namespaces (I might be wrong though).

@otherdaniel
Copy link
Collaborator Author

otherdaniel commented Oct 8, 2021

Looks like we've let this linger for a while... let's have a new go at it.

After reading the feedback here and there, and discussing this with @mozfreddyb, I'd like to - for now - prefer simplicity over expressiveness, so that can start with a simple proposal and extend it as concrete use cases emerge.

The cornerstones are:

  • pseudo-namespaces. As above: There's a fixed set of namespaces supported by the HTML spec. We'll just assign fixed identifiers to them, rather than attempting to support arbitrary namespaces or XML namespaces-style namespace declarations.
  • No wildcards: This is a result of simple-first. We do run the risk of making some use cases rather cumbersome, but what we gain is Sanitizer configurations that should be straightforward to read (and implement).
  • No cleverness in mapping names. This is another simple-first thing. It might make configs harder to write, but it'll hopefully also make them easier to read.

================================

This would be the proposal:

  • Support a set of fixed namespace designators: "html", "svg", "math" for elements, "xml", "xmlns", "xlink" for attributes.

  • No namespace designator defaults to "html" for elements, and to none for attributes.

    • E.g., "p" is "html:p", not "svg:p". "svg:p" is the only way to reference SVG's paragraph element.
    • [Edit:] Alternative: Drop the "html" prefix, or the default, so that there's a unique way to designate HTML elements.
  • The namespace separator is the whitespace character.

    • Alternative: Use the colon character (":").
    • I've been agonizing over this. I think literally everyone else uses colon, so space looks just weird to me.
      But I think @securityMB is right here. With space seperators we can represent all valid HTML names. (See How to identify namespaces / SVG + MathML #72 (comment), which reminds us that "xml:lang" is a valid, specified, non-namespaced attribute name.)
      It also reminds authors that we're doing not-quite-namespaces here, rather than e.g. support of full XML namespace spec.
  • Element/attribute names with a namespace separator but no valid namespace designator in front of it are an error and are dropped.

    • Alternative: We could also throw an exception.
  • All config items remain the same, except they will now parse out the namespace separator; convert element/attributes to their respective namespace. Matching against a config item becomes namespace aware.

    • So e.g. there's a single allowElements list, which would contain a mix of HTML + SVG elements.
  • These rules form a 1:1 relationship between config strings and namespace/elements, except for HTML element, which have a 2:1 relationship. The getConfiguration getters normalize HTML strings to their non-prefixed form.

    • Alternative: I'm confident we should have a prescribed normalization, but I'm unsure which way it should go. I'm fine with either prefixed/non-prefixed.
  • Element/attribute local name normalization will reference whatever the HTML spec does.

  • For convenience, the config gets an allowXXX boolean setting for each namespace, to allow users to turn of e.g. all of SVG without having to re-write their config entirely.

    • Alternative: We might not do this at all, since the naming rules make the config easily filter-able.
    • Alternative: Instead of several boolean-valued config items we could also have one config item which takes a string set and re-use the namespace designators.
    • I'm not sure what the default should be, but I lean towards allow only HTML elements + non-namepsaces attributes by default.

@Andrew-Cottrell
Copy link

Andrew-Cottrell commented Oct 9, 2021

  • No namespace designator defaults to "html" for elements, and to none for attributes.
    • Alternative: Drop the "html" prefix, or the default, so that there's a unique way to designate HTML elements.
  • These rules form a 1:1 relationship between config strings and namespace/elements, except for HTML element, which have a 2:1 relationship. The getConfiguration getters normalize HTML strings to their non-prefixed form.
    • Alternative: I'm confident we should have a prescribed normalization, but I'm unsure which way it should go. I'm fine with either prefixed/non-prefixed.

To me, it seems simpler to drop the "html" prefix, which would ensure a 1:1 relationship in all cases and resolve getConfiguration normalization. But I think the strongest reason to drop the "html" prefix is that would reflect how people currently read & write HTML. If the "html" prefix is retained but not required, I expect most authors would not use it.

It would be good to see specified use cases or other reasons for retaining the "html" prefix, which may suggest answers to the indicated alternatives quoted above.

@Andrew-Cottrell
Copy link

Andrew-Cottrell commented Oct 9, 2021

  • Element/attribute names with a namespace separator but no valid namespace designator in front of it are an error and are dropped.
    • Alternative: We could also throw an exception.

Given the indeterminate lifetime of HTML a new namespace designator may eventually be needed, so it may be more backwards compatible (e.g. newer code in an older browser) to drop rather than throw. I expect static analysis tools or runtime validation libraries will be developed if invalid configuration becomes a problem such that people want to verify. However, there might be security reasons to throw rather than drop.

@mozfreddyb
Copy link
Collaborator

That's great, thanks for writing that up.
I mostly agree, except this one thing:

* For convenience, the config gets an allowXXX boolean setting for each namespace, to allow users to turn of e.g. all of SVG without having to re-write their config entirely.
  
  * Alternative: We might not do this at all, since the naming rules make the config easily filter-able.
  * Alternative: Instead of several boolean-valued config items we could also have one config item which takes a string set and re-use the namespace designators.
  * I'm not sure what the default should be, but I lean towards allow only HTML elements + non-namepsaces attributes by default.

I don't like the idea of adding lots of boolean settings. Instead, I suggest we provide static constants on the sanitizer that allow building and combining lists. We can bikeshed on the names, I don't have strong feelings, but something along the lines of Sanitizer.ALLOWED_HTML_ELEMENTS, Sanitizer.ALLOWED_SVG_ELEMENTS would help implementing your use cases in an (imho) clearer way.

@otherdaniel
Copy link
Collaborator Author

To me, it seems simpler to drop the "html" prefix, which would ensure a 1:1 relationship in all cases and resolve getConfiguration normalization. But I think the strongest reason to drop the "html" prefix is that would reflect how people currently read & write HTML. If the "html" prefix is retained but not required, I expect most authors would not use it.

It would be good to see specified use cases or other reasons for retaining the "html" prefix, which may suggest answers to the indicated alternatives quoted above.

My intuition was that if everyhing else has a prefix then so should HTML, but also that requiring an HTML prefix is an awful lot of extra typing. Admittedly, that's a rather weak argument, and dropping the html prefix would certainly make things simpler.

Given the indeterminate lifetime of HTML a new namespace designator may eventually be needed, so it may be more backwards compatible (e.g. newer code in an older browser) to drop rather than throw. I expect static analysis tools or runtime validation libraries will be developed if invalid configuration becomes a problem such that people want to verify. However, there might be security reasons to throw rather than drop.

This is very true. We should drop (rather than throw). Especially since .getConfiguration() allows developers to check what the browser actually made out of their config.

I don't like the idea of adding lots of boolean settings. Instead, I suggest we provide static constants on the sanitizer that allow building and combining lists. We can bikeshed on the names, I don't have strong feelings, but something along the lines of Sanitizer.ALLOWED_HTML_ELEMENTS, Sanitizer.ALLOWED_SVG_ELEMENTS would help implementing your use cases in an (imho) clearer way.

Yes, that'd also work. I wonder how many 'real' use cases there are for this. I'm guessing HTML-only and everything are common, which can be easily covered by presets.

@otherdaniel
Copy link
Collaborator Author

I'm currently looking at how to spec this. I initially thought I'd go with whitespace as separator, as discussed above, but https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 already specifies a character-by-character representation of all namespaced attributes allowed in HTML, in the table at the end of the subsection. And that uses a colon. I suspect that re-inventing our own representation is not a good idea then.

In either case, I'll update the issue when I have a review-quality PR ready.

@otherdaniel
Copy link
Collaborator Author

Hello again, I've now uploaded PR #137, which drafts supports for SVG and MathML.

I'm not super happy with the result, so it'd be fantastic if people could offer some opinions on whether this is going in the right directions, and how to improve it.

In particular:

  • I've tried hard to base this on existing definitions and precedent in the HTML spec. The result is awkward, since HTML defines colon-based names for attributes (e.g. xlink:href), but allows for colons inside regular element names. In the PR, Sanitizer does the same: Attributes use colon as namespace designator; elements use whitespace. Pretty awkward, IMHO.
  • I've rewritten the "effective" config stuff, since it got in the way of specifying the namespaces. Not sure if better or worse.

@ju1ius
Copy link

ju1ius commented Dec 14, 2021

Hi @otherdaniel,

I've just read the spec draft and first let me thank you for your work because I think this is something much needed for the platform !

Now WRT namespaces, there is an existing syntax that could be reused here: CSS type selectors.

The advantages of using this syntax would be:

  • The syntax already exists, so no need to reinvent the weel, and the matching behavior is precisely defined, including case sensivity.
  • Eventhough usage of the namespace feature in CSS selectors might not be very widespread, authors should already be aware of it.
  • Less friction with the already existing namespace prefix syntax: even if the HTML parser will blindly accept <foo|bar>baz</foo|bar> as a valid element, document.createElement('foo|bar') will not work.
  • Would allow the API to work on non-html documents, provided there'a a way to pass the algorithms a map from prefixes to namespaces

The following example would block bar elements in the urn:foo namespace, baz elements in any namespace, qux elements without a namespace and gizmo elements in the default namespace, in this case that of the document element:

const sanitizer = new Sanitizer({
  defaultNamespace: document.documentElement.namespaceURI,
  namespaces: {
    foo: 'urn:foo',
  },
  blockElements: ['foo|bar', '*|baz', '|qux', 'gizmo'],
})

If the syntax were to be defined in terms of CSS type selectors, this would allow to gradually introduce support for other CSS selectors into the API, so that things like this would become possible:

const sanitizer = new Sanitizer({
  dropElements: [
    'a[target="_blank"]',
    'iframe:not([src^="https://www.youtube.com"])',
  ],
})

What do you think ?

@Andrew-Cottrell
Copy link

Andrew-Cottrell commented Dec 14, 2021

I think using a subset of the CSS selector syntax is a really interesting idea. My main concern with supporting a large subset would be the serializing rules. But many people are using CSS selectors with DOM APIs and things generally seem to work well (although greater use of CSS.escape would probably help). In any case, I would be happy using | as the pseudo-namespace delimiter.

@ju1ius
Copy link

ju1ius commented Dec 14, 2021

@Andrew-Cottrell I don't get why CSS serialization rules would be an issue here. Could you elaborate on this?

@Andrew-Cottrell
Copy link

Andrew-Cottrell commented Dec 14, 2021

I don't get why CSS serialization rules would be an issue here. Could you elaborate on this?

const badSanitizer = new Sanitizer({
  dropElements: [
    'div.123abc' // incorrectly serialized, but an easy mistake to make
  ]
});

const goodSanitizer = new Sanitizer({
  dropElements: [
    'div.\\31 23abc' // correctly serialized, could also use: 'div.' + CSS.escape('123abc')
  ]
});

This probably isn't a serious problem in practice, but I'm slightly concerned with the ease of this mistake in a security context.

@ju1ius
Copy link

ju1ius commented Dec 14, 2021

Ah I see, indeed the spec would need to define what to do in case of an invalid selector. Whether to throw a SyntaxError DOM exception, silently ignore it or whatever would make the most sense in a security-sensitive context.

Currently, the spec just says to removes element names that were normalized to null from the allow lists, so the same should probably be done in case of an invalid selector...

@mozfreddyb
Copy link
Collaborator

triage: Let's keep this one open to ensure we have alignment on a v1 list of allowed elements (html, svg, mathml)

@mozfreddyb
Copy link
Collaborator

Should be moot with #208 landed.

@annevk
Copy link
Collaborator

annevk commented Mar 20, 2024

Do we have an SVG and MathML safelist? Is that tracked anywhere else? That's the main thing I can still see missing.

@bkardell
Copy link

Do we have an SVG and MathML safelist? Is that tracked anywhere else? That's the main thing I can still see missing.

It seems that MathML isn't even mentioned in the spec currently? SVG is here, but that mention only points to the SVG namespace (which is right below the MathML namespace in that doc :)). Is there a reason it wasn't included?

@benbucksch
Copy link

benbucksch commented Nov 27, 2024

There's the extremes "known to be harmful" and "known and extensively security tested (by browser vendors and external testers) to be safe and harmless". But there's a large area in the middle, including complex features that have not been extensively tested for security yet. To give an ill-fitting example: WebGL is a complex feature that I would definitely not allow through a sanitizer. Even though it's not "known to be bad", like direct RAM access by the GFX card. Likewise, if a MathML feature cannot be proven to be practically impossible to cause security holes (including buffer overflows within the render engine), then it should be disabled in a sanitizer. If MathML hasn't received extensive security testing yet, it's better to not allow it by default.

@mozfreddyb
Copy link
Collaborator

While I agree that some capabilities are risky to expose to the web (like e.g., graphics APIs), I don't think it is the sanitizer's job to control or get in the way of non-declarative APIs. WebGL is not a markup feature. This is about elements and attributes - features of a document that may contain user-supplied content.

The web is beautiful because it is powerful. I don't think we or anyone else should play feature-police of what is considered risky and what is not. Specifically, because you can't prove a negative but also generally because I don't think we should care about implementation bugs. If a browser has a bug, then it should fix it 😉

@benbucksch
Copy link

benbucksch commented Nov 27, 2024

For security-sensitive use cases, the whole purpose of the sanitizer is to allow only what is proven to be secure ("secure" meaning extremely unlikely to have a security hole in the next years). I understand that's different in approach from only "remove what is known to be harmful". I am making that difference explicit, so that both needs can be met, possibly with different profiles or options in the config. Security-conscious use cases need a profile where unproven features are not enabled. The sanitizer is a "feature police" by its very nature.

(And FWIW, any remote code - like JavaScript or WebGL, sandboxed or sanitized or not - definitely has absolutely no business in sanitized HTML code.)

@benbucksch
Copy link

benbucksch commented Nov 27, 2024

If a browser has a bug, then it should fix it 😉

That's the thing: When I created the sanitizer in Firefox ca. 25 years ago, whole purpose of sanitizing HTML was to protect against browser bugs ;-) . There are dozens of remote code execution security holes in every browser, every month. That just isn't good enough, for some situations. I.e. there are factor 100 to 1000 too many of those browser bugs, for many use cases. The sanitizer is the way to deal with that, without resorting to plaintext.

To get back to topic here: There should be

  1. a super-conservative profile that turns off MathML entirely, and
  2. a conservative profile that enables only those parts of MathML that have seen extensive testing and are so simple by their nature that they cannot concievably create security holes, and
  3. a permissive profile that allows most of MathML, other than the parts that are known to be problematic. (Which is what you are talking about.)

@annevk
Copy link
Collaborator

annevk commented Nov 27, 2024

@polx thanks! Here's the thinking from the group on this:

  1. It would be great to have a list of MathML elements matching the goals set out in Safe sanitizer default #228.
  2. Our thinking is that we integrate this list here (and eventually the HTML standard) for the time being. Perhaps at some point it can be maintained in a more decentralized manner, but given it's security-sensitive we'd like to keep it centralized for now.

@benbucksch accounting for browser bugs is not realistic. There might well be bugs in the sanitizer, a browser could start executing the contents of a span element as script, etc. We're also not going to do profiles in v1. We'll have a default though along the lines set out in #228. And we'll allow the configuration to be modified so other needs can be met.

@benbucksch
Copy link

benbucksch commented Nov 27, 2024

@benbucksch accounting for browser bugs is not realistic

By removing JavaScript, I kill 90% of browser bugs, and when I disable <video>, I kill 5% more in video decoders, so it's proven to be possible to avoid browser bugs by using a simple sanitizer. The same is true for other features which are untested and/or prone to be buggy. I don't know whether MathML has received a lot of security scrutiny.

@polx
Copy link

polx commented Nov 28, 2024

We're working on it.

While writing, I noticed that one element allows to "hide content" in it (this is for layout purposes and allows subtle layouts to be built). Should we consider this feature as a "bad feature" (like a white character on a white background) that the sanitizer should aim at removing?

@annevk
Copy link
Collaborator

annevk commented Nov 29, 2024

@polx given that we want to address styling-based attacks, I'd think so. You could keep it in a separate list for closer review down the line perhaps?

@Sora2455
Copy link

As a prospective user, I'm planning to use a strict whitelist, and I imagine all security-conscious engineers will do the same. So it doesn't matter to me if these potentially-risky elements are excluded by default or not, I'm excluding them myself.

@polx
Copy link

polx commented Jan 14, 2025

Here is a proposed version. We should discuss and decide on this list on the next MathML-core meeting at the end of January 2025. Comments are very welcome.


MathML Safe List

Short Version

MathML-core considers all elements and attributes of MathML-core (as listed in section 2.1 of MathML-core) as safe and not needing a sanitziation except the following elements.

We recommend the Sanitzer API to sanitize MathML by keeping all elements and attributes except the follwing:

  • any common attribute with HTML attributes which need a sanitzation,
  • the maction and mphantom elements (the element can be replaced by their first child), and
  • any annotation or annotation-xml element whose encoding attribute is of a media-type that is is either absent or is not among the trusted types or if it contains an href attribute.

Detailed Version

MathML-core considers the following elements and attributes of MathML-core as safe and not needing sanitization:

Safe "as-is" Elements of MathML-core:
math, merror, mfrac, mi, mmultiscripts, mn, mo, mover, mpadded, mprescripts, mroot, mrow, ms, mspace, msqrt, mstyle, msub, msubsup, msup, mtable, mtd, mtext, mtr, munder, munderover, semantics

Attributes of MathML-core:
dir, displaystyle, mathbackground, mathcolor, mathsize, scriptlevel, encoding, display, linethickness, intent and arg; on mo elements: form, fence, separator, lspace, rspace, stretchy, symmetric, maxsize, minsize, largeop, movablelimits; on mpadded elements: width, height, depth, lspace, voffset, on mspace elements: width, height, depth, on munderover elements accent and accentunder; on mtd elements columnspan and rowspan.

Moreover, the following attributes have their syntax and semantics specified in the HTML specification. The sanitizer behaviour on these attributes should be as is done on HTML elements: on*, id, class, style, data-*, autofocus, nonce,tabindex (for example any javascript should be removed).

The elements of MathML-core which need treatment by the sanitizers are the following:

  • annotation and annotation-xml if their encoding attribute is not considered of a safe type (e.g. if the encoding is text/plain then it could be kept). If removed, the element should be replaced by its first child.
  • maction is replaced by their first child
  • mphantom is removed

@annevk
Copy link
Collaborator

annevk commented Jan 15, 2025

@polx thank you! Can you clarify what you mean by "replaced by their first child"? What happens if it contains multiple children? Is it literally what node.firstChild would return so a comment or whitespace-only text node could do?

@otherdaniel
Copy link
Collaborator Author

Thanks for the list!

Added a preliminary PR at #250. (The PR is not very readable, since I've based it on an another in-progress PR. The last commit contains the intended change.

Some notes:

  • There is currently no mechanism in Sanitizer for "namespace-global" attributes, so the MathML-globals are merged with the HTML-globals.
  • There is currently no mechanism in Sanitizer for "replace by first child".
  • I'm guessing the munderover attributes should also exist on munder and mover (each)?

@polx thank you! Can you clarify what you mean by "replaced by their first child"? What happens if it contains multiple children? Is it literally what node.firstChild would return so a comment or whitespace-only text node could do?

I think the example at https://developer.mozilla.org/en-US/docs/Web/MathML/Element/maction#examples would explain this. Judging by the browser compatibility section, this isn't super well supported.

@annevk
Copy link
Collaborator

annevk commented Jan 15, 2025

I guess that's first element child then, but yeah, that's not an operation we currently offer. Blocking is probably the most reasonable for now then.

@bkardell
Copy link

bkardell commented Jan 15, 2025

Hmm, will it also block these?

<div hidden>Foo</div>

<div style="visibility: hidden">Bar</div>

DOMPurifer doesn't. <mphantom> is basically like the latter one with a default rule like that. I think the answer to whether you need to block it or not is probably based on how you answer the above? It doesn't seem more dangerous, if anything probably less than the later one.

@otherdaniel
Copy link
Collaborator Author

Hmm, will it also block these?

<div hidden>Foo</div>

<div style="visibility: hidden">Bar</div>

By default, yes. (We could change that of course; provided the group reaches consensus.)

Some background: The Sanitizer group has been going back and forth on this, but we effectively create three classes of markup: unsafe, default-allowed, allowable. We can be somewhat opinionated on what goes into the default-allowed group. As presently proposed in #244 + #250, .setHTML(...) (default usage) would allow neither <mphantom> not style=, because neither the element not the global style= attribute are in the default config. But .setHTML(..., {sanitizer: {}}) (not default, but empty configuration) would allow both of them. It wouldn't allow <script>, because that's in the "unsafe" set.

DOMPurify doesn't have an "opinionated" default in the sense I'm using the word here. The DOMPurify default is what you'd get with an empty config with Sanitizer API.

@polx
Copy link

polx commented Jan 15, 2025

I guess that's first element child then

Oooooh. Good catch. And... I made an error.

  • annotation and annotation-xml with non-trusted types: just remove, not replace by first child.
  • maction: these should be replaced by the first child element or, if only made of text, by the text (and other nodes should be discarded)

Now. maction is rather a small use-case... it's on its way to deprecation in favour of javascript-based solutions.
And removing is ok for the other elements that should not be kept.

I guess we should assemble this into a document. Should it be inserted in MathML-Core (not sure we still can)? Just a working-group-note? So far, it's only living as individual documents at some persons' computer.

@annevk
Copy link
Collaborator

annevk commented Jan 16, 2025

It'll be documented here (@otherdaniel created #250) and then upstreamed into the HTML standard. For now we'll keep the list centralized. At some point future point we might consider reorganizing that, once we're more comfortable with how updates go and such.

@polx
Copy link

polx commented Jan 16, 2025

I have created mathml-safe-list to track the evolution of this document and to consider for furhter inclusion.

There was the error that the annotation and annotation-xml elements should not be replaced but removed. This was corrected.
A clarification on what it meant to replace by the first child was made.

Both are done in the commit 1eb208 of the mathml-doc repository.

I seem to understand that the PR #250 seems to honour this understanding (and ignore the possible replacement by a child of maction which is not a major problema at all). So this is good news. I will notify here once we have the confirmation of the working group.

@fred-wang
Copy link

mphantom elements (the element can be replaced by their first child)

That does not work. mphantom can have any number of elements ; it renders as mrow + visibility: hidden. If you only keep the first child, you are effectively changing the layout of the element and making the content visible.

@polx
Copy link

polx commented Jan 17, 2025

Hello all,
Thanks @fred-wang. We discussed the topic in the MathML-working-group's meeting recently (earlier than the group focussed on MathML-core) and the removal of mphantom was announced by all as being a problem: It would break layout in multiple fashion.

A simple example of mphantom is a fraction with an empty numerator (the thing above the bar): This makes sense to layout or create otherwise hints that something will come here. So as to help layout, mphantom takes the space of something, and that something is the content of the mphantom.

While it is true that the content is never shown unless you edit the content, it should be considered safe if we further process the content of MathML according to the sanitizer's behaviour.

I have made changes (1, 2) to the proposed text removing the removal of mphantom and am requesting to add mphantom among the safe elements. I think we can agree on this.

@bkardell
Copy link

bkardell commented Jan 17, 2025

note that #250 needs to be updated if so, but I'm not sure that we have agreement on that since, as @otherdaniel said it seems that similar things in HTML would be removed but could be easily controlled by the site (it seems by just passing an empty config)?

otherdaniel added a commit to otherdaniel/purification that referenced this issue Jan 17, 2025
@annevk
Copy link
Collaborator

annevk commented Jan 18, 2025

Right, we don't want to allow hidden things.

otherdaniel added a commit that referenced this issue Jan 22, 2025
Defaults for MathML, based chiefly on https://w3c.github.io/mathml-docs/mathml-safe-list and discussion in #103.
@polx
Copy link

polx commented Jan 23, 2025

I am trying to find out more what is possible and thinkable:

  • replacing mphantom with an mrow carrying a CSS that makes it invisible would be useful (but probably not desired)
  • replacing mphantom and its children with an mspace might be the least breaking change as some MathML elements have positional children (e.g. the mroot or mfrac elements for which removing a child would exchange things, e.g. moving the denominator to the numerator)
  • removing the children of mphantom but keeping the element itself would also not break things

Which way would be possible and thinkable?

This is important as mphantom is used in a non-neglectable way. E.g. In the world of arXiv, mphantom is found on about 40k formulas, about 1% of the formulas (see this report).

@annevk
Copy link
Collaborator

annevk commented Jan 23, 2025

There's basically two options:

  • Replace with children.
  • Remove.

@polx
Copy link

polx commented Jan 24, 2025

Is changing an element not possible?
My idea would be to recommend replacing <mphantom> by <mrow> and </mphantom> by </mrow>.

Both replace-with-children and remove will break some expressions.

@bkardell
Copy link

bkardell commented Jan 24, 2025

@polx they can still pass an empty config and <mphantom> will come through (along with other 'hidden' things I guess) or specify a bit better config

@polx
Copy link

polx commented Jan 25, 2025

Yes, I am aware of that. But this is not what the "default" users of the sanitizer API do, I think.

I would be happy to be wrong!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests