-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using ZeroTrie for property parser #5576
Conversation
This comment was marked as spam.
This comment was marked as spam.
e9be7c2
to
0ddcc6d
Compare
As mentioned this morning, not a huge fan of the recursion and a bit wary that it can cause stack overflows with malicious data, perhaps we can have some recursion limit applied, probably by limiting the length of the input string to something reasonable. |
components/properties/src/names.rs
Outdated
skip_cursor.step(skip); | ||
if let Some(r) = recurse(skip_cursor, name) { | ||
return Some(r); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's probably faster to check for None
in the return value of .step
in order to avoid the extra recursion call. In most cases the step will be None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.step
does not have a return value though. Moving the empty check in front of every recursive call complicates the logic and I'm not convinced there is much of a cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's only ZeroAsciiIgnoreCaseTrie that returns a value in .step
. I would approve a two-line change to make ZeroTrieSimpleAsciiCursor also return a value in .step
.
It could return Option<u8>
, Option<()>
, or bool
} | ||
|
||
// Skip whitespace, underscore, hyphen in trie. | ||
for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do these characters like \t
and \n
actually occur in the trie? Seems unlikely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in today's compiled data, but according to the spec this is what we have to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the cursor for these characters is fairly cheap, and recursion only happens if a character is actually found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citation in the spec?
You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.
I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t
and \n
. I understand skipping those in the user's string, but not in the trie string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns in the recurse function, but they can be addressed in a follow-up. I think overall this is the right direction, and we can make it faster.
} | ||
|
||
// Skip whitespace, underscore, hyphen in trie. | ||
for skip in [b'\t', b'\n', b'\x0C', b'\r', b' ', 0x0B, b'_', b'-'] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Citation in the spec?
You say "fairly cheap", but this is sort-of a hot path (for example, regex and unicode set parsing), and every one of these requires a function call, and function calls are not free. My guess is that these extra function calls together make the function about 2x as slow. I could be wrong.
I also find it incredibly unlikely that the UCD would add a canonical property name containing characters like \t
and \n
. I understand skipping those in the user's string, but not in the trie string.
Fixes #4861