-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed API: MathML, SVG support, plus localname case-handling. #103
Comments
After thinking about it for a little while, I see some problems with using
Furthermore, I am not sure that any namespace should be the default. The last few bypasses of DOMPurify stemmed from the fact that some elements were created in unexpected namespace (like |
Thanks, this is good feedback!
|
I think It might be useful to be able to block |
Agreed on "full" namespaces. My argument was just about the separator that we use. I saw some tests in web platform that used space as a separator (for instance:
Yeah, I'm not that entirely sure on importance of this one. My argument was that if you make a typo in the namespace, for instance:
My proposal is to:
|
Looks like we've let this linger for a while... let's have a new go at it. After reading the feedback here and there, and discussing this with @mozfreddyb, I'd like to - for now - prefer simplicity over expressiveness, so that can start with a simple proposal and extend it as concrete use cases emerge. The cornerstones are:
================================ This would be the proposal:
|
To me, it seems simpler to drop the "html" prefix, which would ensure a 1:1 relationship in all cases and resolve It would be good to see specified use cases or other reasons for retaining the "html" prefix, which may suggest answers to the indicated alternatives quoted above. |
Given the indeterminate lifetime of HTML a new namespace designator may eventually be needed, so it may be more backwards compatible (e.g. newer code in an older browser) to drop rather than throw. I expect static analysis tools or runtime validation libraries will be developed if invalid configuration becomes a problem such that people want to verify. However, there might be security reasons to throw rather than drop. |
That's great, thanks for writing that up.
I don't like the idea of adding lots of boolean settings. Instead, I suggest we provide static constants on the sanitizer that allow building and combining lists. We can bikeshed on the names, I don't have strong feelings, but something along the lines of |
My intuition was that if everyhing else has a prefix then so should HTML, but also that requiring an HTML prefix is an awful lot of extra typing. Admittedly, that's a rather weak argument, and dropping the
This is very true. We should drop (rather than throw). Especially since
Yes, that'd also work. I wonder how many 'real' use cases there are for this. I'm guessing HTML-only and everything are common, which can be easily covered by presets. |
I'm currently looking at how to spec this. I initially thought I'd go with whitespace as separator, as discussed above, but https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 already specifies a character-by-character representation of all namespaced attributes allowed in HTML, in the table at the end of the subsection. And that uses a colon. I suspect that re-inventing our own representation is not a good idea then. In either case, I'll update the issue when I have a review-quality PR ready. |
Hello again, I've now uploaded PR #137, which drafts supports for SVG and MathML. I'm not super happy with the result, so it'd be fantastic if people could offer some opinions on whether this is going in the right directions, and how to improve it. In particular:
|
Hi @otherdaniel, I've just read the spec draft and first let me thank you for your work because I think this is something much needed for the platform ! Now WRT namespaces, there is an existing syntax that could be reused here: CSS type selectors. The advantages of using this syntax would be:
The following example would block const sanitizer = new Sanitizer({
defaultNamespace: document.documentElement.namespaceURI,
namespaces: {
foo: 'urn:foo',
},
blockElements: ['foo|bar', '*|baz', '|qux', 'gizmo'],
}) If the syntax were to be defined in terms of CSS type selectors, this would allow to gradually introduce support for other CSS selectors into the API, so that things like this would become possible: const sanitizer = new Sanitizer({
dropElements: [
'a[target="_blank"]',
'iframe:not([src^="https://www.youtube.com"])',
],
}) What do you think ? |
I think using a subset of the CSS selector syntax is a really interesting idea. My main concern with supporting a large subset would be the serializing rules. But many people are using CSS selectors with DOM APIs and things generally seem to work well (although greater use of CSS.escape would probably help). In any case, I would be happy using |
@Andrew-Cottrell I don't get why CSS serialization rules would be an issue here. Could you elaborate on this? |
const badSanitizer = new Sanitizer({
dropElements: [
'div.123abc' // incorrectly serialized, but an easy mistake to make
]
});
const goodSanitizer = new Sanitizer({
dropElements: [
'div.\\31 23abc' // correctly serialized, could also use: 'div.' + CSS.escape('123abc')
]
}); This probably isn't a serious problem in practice, but I'm slightly concerned with the ease of this mistake in a security context. |
Ah I see, indeed the spec would need to define what to do in case of an invalid selector. Whether to throw a Currently, the spec just says to removes element names that were normalized to null from the allow lists, so the same should probably be done in case of an invalid selector... |
triage: Let's keep this one open to ensure we have alignment on a v1 list of allowed elements (html, svg, mathml) |
Should be moot with #208 landed. |
Do we have an SVG and MathML safelist? Is that tracked anywhere else? That's the main thing I can still see missing. |
It seems that MathML isn't even mentioned in the spec currently? SVG is here, but that mention only points to the SVG namespace (which is right below the MathML namespace in that doc :)). Is there a reason it wasn't included? |
There's the extremes "known to be harmful" and "known and extensively security tested (by browser vendors and external testers) to be safe and harmless". But there's a large area in the middle, including complex features that have not been extensively tested for security yet. To give an ill-fitting example: WebGL is a complex feature that I would definitely not allow through a sanitizer. Even though it's not "known to be bad", like direct RAM access by the GFX card. Likewise, if a MathML feature cannot be proven to be practically impossible to cause security holes (including buffer overflows within the render engine), then it should be disabled in a sanitizer. If MathML hasn't received extensive security testing yet, it's better to not allow it by default. |
While I agree that some capabilities are risky to expose to the web (like e.g., graphics APIs), I don't think it is the sanitizer's job to control or get in the way of non-declarative APIs. WebGL is not a markup feature. This is about elements and attributes - features of a document that may contain user-supplied content. The web is beautiful because it is powerful. I don't think we or anyone else should play feature-police of what is considered risky and what is not. Specifically, because you can't prove a negative but also generally because I don't think we should care about implementation bugs. If a browser has a bug, then it should fix it 😉 |
For security-sensitive use cases, the whole purpose of the sanitizer is to allow only what is proven to be secure ("secure" meaning extremely unlikely to have a security hole in the next years). I understand that's different in approach from only "remove what is known to be harmful". I am making that difference explicit, so that both needs can be met, possibly with different profiles or options in the config. Security-conscious use cases need a profile where unproven features are not enabled. The sanitizer is a "feature police" by its very nature. (And FWIW, any remote code - like JavaScript or WebGL, sandboxed or sanitized or not - definitely has absolutely no business in sanitized HTML code.) |
That's the thing: When I created the sanitizer in Firefox ca. 25 years ago, whole purpose of sanitizing HTML was to protect against browser bugs ;-) . There are dozens of remote code execution security holes in every browser, every month. That just isn't good enough, for some situations. I.e. there are factor 100 to 1000 too many of those browser bugs, for many use cases. The sanitizer is the way to deal with that, without resorting to plaintext. To get back to topic here: There should be
|
@polx thanks! Here's the thinking from the group on this:
@benbucksch accounting for browser bugs is not realistic. There might well be bugs in the sanitizer, a browser could start executing the contents of a |
By removing JavaScript, I kill 90% of browser bugs, and when I disable |
We're working on it. While writing, I noticed that one element allows to "hide content" in it (this is for layout purposes and allows subtle layouts to be built). Should we consider this feature as a "bad feature" (like a white character on a white background) that the sanitizer should aim at removing? |
@polx given that we want to address styling-based attacks, I'd think so. You could keep it in a separate list for closer review down the line perhaps? |
As a prospective user, I'm planning to use a strict whitelist, and I imagine all security-conscious engineers will do the same. So it doesn't matter to me if these potentially-risky elements are excluded by default or not, I'm excluding them myself. |
Here is a proposed version. We should discuss and decide on this list on the next MathML-core meeting at the end of January 2025. Comments are very welcome. MathML Safe ListShort VersionMathML-core considers all elements and attributes of MathML-core (as listed in section 2.1 of MathML-core) as safe and not needing a sanitziation except the following elements. We recommend the Sanitzer API to sanitize MathML by keeping all elements and attributes except the follwing:
Detailed VersionMathML-core considers the following elements and attributes of MathML-core as safe and not needing sanitization: Safe "as-is" Elements of MathML-core: Attributes of MathML-core: Moreover, the following attributes have their syntax and semantics specified in the HTML specification. The sanitizer behaviour on these attributes should be as is done on HTML elements: The elements of MathML-core which need treatment by the sanitizers are the following:
|
@polx thank you! Can you clarify what you mean by "replaced by their first child"? What happens if it contains multiple children? Is it literally what |
Thanks for the list! Added a preliminary PR at #250. (The PR is not very readable, since I've based it on an another in-progress PR. The last commit contains the intended change. Some notes:
I think the example at https://developer.mozilla.org/en-US/docs/Web/MathML/Element/maction#examples would explain this. Judging by the browser compatibility section, this isn't super well supported. |
I guess that's first element child then, but yeah, that's not an operation we currently offer. Blocking is probably the most reasonable for now then. |
Hmm, will it also block these?
DOMPurifer doesn't. |
By default, yes. (We could change that of course; provided the group reaches consensus.) Some background: The Sanitizer group has been going back and forth on this, but we effectively create three classes of markup: unsafe, default-allowed, allowable. We can be somewhat opinionated on what goes into the default-allowed group. As presently proposed in #244 + #250, DOMPurify doesn't have an "opinionated" default in the sense I'm using the word here. The DOMPurify default is what you'd get with an empty config with Sanitizer API. |
Oooooh. Good catch. And... I made an error.
Now. I guess we should assemble this into a document. Should it be inserted in MathML-Core (not sure we still can)? Just a working-group-note? So far, it's only living as individual documents at some persons' computer. |
It'll be documented here (@otherdaniel created #250) and then upstreamed into the HTML standard. For now we'll keep the list centralized. At some point future point we might consider reorganizing that, once we're more comfortable with how updates go and such. |
I have created mathml-safe-list to track the evolution of this document and to consider for furhter inclusion. There was the error that the Both are done in the commit 1eb208 of the mathml-doc repository. I seem to understand that the PR #250 seems to honour this understanding (and ignore the possible replacement by a child of |
That does not work. mphantom can have any number of elements ; it renders as mrow + visibility: hidden. If you only keep the first child, you are effectively changing the layout of the element and making the content visible. |
Hello all, A simple example of While it is true that the content is never shown unless you edit the content, it should be considered safe if we further process the content of MathML according to the sanitizer's behaviour. I have made changes (1, 2) to the proposed text removing the removal of |
note that #250 needs to be updated if so, but I'm not sure that we have agreement on that since, as @otherdaniel said it seems that similar things in HTML would be removed but could be easily controlled by the site (it seems by just passing an empty config)? |
Right, we don't want to allow hidden things. |
Defaults for MathML, based chiefly on https://w3c.github.io/mathml-docs/mathml-safe-list and discussion in #103.
I am trying to find out more what is possible and thinkable:
Which way would be possible and thinkable? This is important as |
There's basically two options:
|
Is changing an element not possible? Both replace-with-children and remove will break some expressions. |
@polx they can still pass an empty config and |
Yes, I am aware of that. But this is not what the "default" users of the sanitizer API do, I think. I would be happy to be wrong! |
Seperate issue, because this tries to pull together several existing issues in one.
javascript:
bypass via<svg>
anduse
. #84 + Bypass viaSVGAnimationElement.onend
. #85).The current problems are:
This mainly picks up SecurityMB's comment in #72, and tries to extend it to cover anything.
"html:name"
,"svg:name"
,"math:name"
, or"*:name"
. The latter can also be written as"name"
. That is, any namespace is the default.*:
variant matches any element with the given localName, thehtml:
,svg:
,math:
variants match against the localname, iff the parser has placed the elements into the corresponding namespace. (Since now all methods that (implicitly) parse have an explicit context, this ought to work with a rather low surprise factor.)html:
and*:
variants match case in-sensitively. The config returns them lower-cased. The other two match case-sensitively and the config getter returns their names as-is.*:
names without the prefix. (I.e.,"*.script"
becomes"script"
.)"wondernamespace:script"
.)WDYT?
I find the case-handling based on the namespace to be slightly surprising, but this matches what HTML parser / DOM does. If we embrace that, I think we the rest of the API is fairly straight-forward.
The text was updated successfully, but these errors were encountered: