-
-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331
Comments
Moving this out one release to make sure the CSS query hacking is stable. |
html5 subclassing --- **What problem is this PR intended to solve?** See #2331 for context. I want to start getting things in place to make it possible to seamlessly switch to HTML5 parsing by default on supported platform. Part of this will require subclassing behavior to work properly (i.e., as Loofah expects it to, where a subclass of Nokogiri::HTML5::Document will return the appropriate subclass from `.parse`). This PR introduces that subclassing behavior, and makes all the HTML4 tests explicitly use `HTML4` instead of `HTML`. Note that `Gumbo.parse` now takes an additional argument, which is the class that should be used for the new document. `Gumbo.parse` is considered to be an internal-only API and so this shouldn't be a breaking change, but it might be worth mentioning in release notes just in case. **Have you included adequate test coverage?** Yes, additional coverage has been added to `test/html5/test_api.rb` **Does this change affect the behavior of either the C or the Java implementations?** HTML5 only exists in the CRuby implementation
A baby step we'll do first is: support subclassing in v1.14.0 to enable Loofah and Rails::Html::Sanitizer to default to HTML5:
So I'm pushing this out to v1.15.0 |
See #2569 for one performance concern that we should benchmark and address. Update: it's been addressed. |
It may be worth testing upstream Capybara before releasing this change, as Capybara has some logic for toggling between HTML4 and HTML5 parsers already. |
Tested HTML: big_shopping.html.zip |
We should profile and figure out where it is spending the majority of its time. One thing that could probably be improved is turning the part of the state machine that reads characters into a buffer into a loop rather than needing to do the state machine dispatch per character. This potentially includes reading tag names and just reading text. The state machine is complicated though and we’d need to look at the tests closely to make sure they adequately cover such a change. |
@stevecheckoway here's some profiling info on what the parser is doing when parsing
I won't attempt to analyze that today, but the amount of time spent in |
Some of that is quite odd. The fact that I just looked at the assembly expecting that the code would be essentially be a few loads and a store. This is the only line of the function that does anything: parser->_tokenizer_state->_is_adjusted_current_node_foreign = is_foreign; The assembly that is being emitted contains calls to the empty diff --git a/gumbo-parser/src/util.c b/gumbo-parser/src/util.c
index d1ab2d7a..6238c296 100644
--- a/gumbo-parser/src/util.c
+++ b/gumbo-parser/src/util.c
@@ -63,6 +63,4 @@ void gumbo_debug(const char* format, ...) {
va_end(args);
fflush(stdout);
}
-#else
-void gumbo_debug(const char* UNUSED_ARG(format), ...) {}
#endif
diff --git a/gumbo-parser/src/util.h b/gumbo-parser/src/util.h
index dfdf465b..5c6ddd8c 100644
--- a/gumbo-parser/src/util.h
+++ b/gumbo-parser/src/util.h
@@ -21,7 +21,11 @@ void* gumbo_realloc(void* ptr, size_t size) RETURNS_NONNULL;
void gumbo_free(void* ptr);
// Debug wrapper for printf
+#ifdef GUMBO_DEBUG
void gumbo_debug(const char* format, ...) PRINTF(1);
+#else
+static inline void gumbo_debug(const char* UNUSED_ARG(format), ...) PRINTF(1) {}
+#endif
#ifdef __cplusplus
} After that change, the emitted code is
But even that could be inlined with link-time optimization. (In fact, doing a link time optimization and also explicitly setting the symbols that are exported from the Those two seem like easy wins although setting which symbols are exported may impact downstream projects that rely on the Nokogiri C extension's symbols (like Nokogumbo did). Unfortunately, the 2-pass procedure where we first build a DOM tree using Gumbo's data structures and then build a DOM tree using libxml2's data structures does mean we have some essentially unavoidable overhead. Maybe gumbo could be modified to build a libxml2 DOM itself. That's likely a significant undertaking. |
I'm going to open up a new issue to dive into some these optimizations, since this issue is related but not specific to performance. Let's move the performance conversation to #2722. |
Setting this to the v2.0 milestone since it's likely to be a breaking change for some folks. |
Nokogiri is on the path to parsing with HTML5 by default: sparklemotion/nokogiri#2331 But, there are some things they still need to do. For those of us who want to opt-in to HTML5 parsing, I've added an option for it. This will prevent the gem from messing with the structure of the html (specifically, prematurely closing <a> tags that wrapped table elements.
Nokogiri is on the path to parsing with HTML5 by default: sparklemotion/nokogiri#2331 But, there are some things they still need to do. For those of us who want to opt-in to HTML5 parsing, I've added an option for it. This will prevent the gem from messing with the structure of the html (specifically, prematurely closing <a> tags that wrapped table elements.
* Asserted on broken behaviour with HTML4 parsing This commit just captures existing behaviour because in the next commit I'm going to make this configurable * Allow parsing with HTML5 Nokogiri is on the path to parsing with HTML5 by default: sparklemotion/nokogiri#2331 But, there are some things they still need to do. For those of us who want to opt-in to HTML5 parsing, I've added an option for it. This will prevent the gem from messing with the structure of the html (specifically, prematurely closing <a> tags that wrapped table elements.
This issue is a placeholder for work to be done to use the HTML5 parsing engine by default on platforms where it's supported (meaning, not-JRuby).
Specifically this probably means that when the
HTML5
module exists ...Nokogiri::HTML()
should proxy toNokogiri::HTML5()
Nokogiri.parse()
should proxy toNokogiri::HTML5.parse()
bin/nokogiri
should supporthtml4
andhtml5
options, and the existinghtml
option should proxy tohtml5
There may be other behaviors we want to switch to the HTML5 parser as well.
Let's also please make sure to do some benchmarks before changing the default behavior. In particular this would be document and fragment parsing, as well as any CSS selectors that are conditionally translated (see #2376).
Pre-work:
The text was updated successfully, but these errors were encountered: