Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure sanitizing algorithms can be implemented efficiently #259

Open
smaug---- opened this issue Feb 24, 2025 · 6 comments
Open

Ensure sanitizing algorithms can be implemented efficiently #259

smaug---- opened this issue Feb 24, 2025 · 6 comments
Assignees

Comments

@smaug----
Copy link

This is a rather vague issue, but it isn't immediately clear how much overhead sanitizing algorithms will add to normal parsing.
innerHTML setter implementations are highly optimized in browsers, and would be good if setHTML could be too (as much as possible).
I wouldn't be surprised if setHTML implementations are first a bit less optimized, but hopefully they could get better over time, assuming the spec algorithms allow that.

@hsivonen, @mozfreddyb

@mozfreddyb
Copy link
Collaborator

So far, we have mostly treated that as an implementation detail rather than a result of how the spec is written.
Are there any specific improvements that need to be taken into consideration when specifying?
If so, we'd be happy to take them into considerations.

@smaug----
Copy link
Author

smaug---- commented Feb 28, 2025

I'm mostly thinking that can the algorithms work while parsing, or is there something which requires them to run on the result document fragment. (the latter being possibly slow)

@annevk
Copy link
Collaborator

annevk commented Feb 28, 2025

I think "replace" might be tricky to implement during parsing. Imagine <table><div><td> as input with "replace" set to table elements. We currently require the div to end up as a sibling to the tbody element (which contains a tr element which contains the td element). And I don't think we'd want it any other way. (We should add a test for this.)

It's probably not impossible though and there's nothing in sanitization that depends on things that are later in the tree, it's just that the HTML parser sometimes creates trees in weird ways.

@annevk
Copy link
Collaborator

annevk commented Feb 28, 2025

The above also made me realize that the act of sanitization can result in mutation XSS as sanitization is mostly blind to the realities of HTML syntax. I think that's okay though. If you're going to replace a table element you are asking for a bad time and it's really hard to stop people from asking for a bad time if they want one.

@otherdaniel
Copy link
Collaborator

For our past implementation, I had measured sanitization as being much faster than parsing so that the total added time was fairly low. (Assuming the config has already been processed. The current API makes it fairly easy to re-process the config every time.) The implementation wasn't terribly sophisticated, only pre-processing the config into hash tables / hash sets, and using the browser's existing internal abstraction for QNames (which are fast to compare).

The new implementation isn't ready yet, but I expect much the same.

I think an incremental implementation within the HTML parser should be possible: We currently use an in-order traversal of the resulting DOM tree, but all decisions regarding any specific DOM node are made when reaching that node. It should be possible to convert that into an incremental algorithm where that same decision is made just when that particular node has finished parsing. That'd likely increase the maintenance burden by a good bit, though. My impression is that the HTML parser is already fairly complicated, and additional complexity will require an argument.

@mozfreddyb
Copy link
Collaborator

We discussed this as a group and want to keep this as the current architecture where parser & sanitizer are separate issues. @annevk has agreed to submit a test-case to the sanitizer-api tests in wpt such that tree construction oddities (the <table><div><td> example) are working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants