-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsmParser: parse source comment using scanner instead of regex #15209
base: develop
Are you sure you want to change the base?
Conversation
ff37008
to
f3b731e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nitpicks and small suggestions.
Also, just to confirm/point to, I see that we have some tests in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a changelog entry for the bugfix.
Also note that github is a bit picky about the fixes
/closes
annotation. fixes issue #15207
did not work, The PR is not connected to the issue and won't close it automatically. You need Fixes #15207
, i.e. without anything in between the word and the issue number.
Also, just to confirm/point to, I see that we have some tests in
test/libyul/Parser.cpp
which seems to cover parsing. Not sure if we need or would benefit from adding something more there.
The repro is a bit large and otherwise trivial, so I think it's ok to skip it.
I think we're missing some coverage for whitespace and newline handling behavior, because the PR passes all tests even though the way it handles whitespace seems to have changed.
7cf8e50
to
b3fd10e
Compare
b3fd10e
to
5bca035
Compare
I guess you could now also remove the |
5bca035
to
a1fb95f
Compare
Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on |
It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking Invoking the same with |
yep exactly! its because of the debug comments. I ran into this when I was working on the debug attribute parsing stuff. I would say we should merge this soon so I can rework my PR. |
In general it is a bug in |
…ed: Whitespace between the indices as well as single-quoted code snippets are now allowed.
a1fb95f
to
64ab6be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need some some adjustments in the scanner and a few more tests. Other than that it looks pretty good now.
if (c == '\\') | ||
scanEscape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this code correctly, this does not quite match the behavior of the original regex. The code here will just skip invalid escape sequences and stray \
at the end of input. If we want to match the original regex, we'd have to instead:
- not interpret unrecognized escapes, just keep them as is,
- report an unterminated string if we reach
\
at the end of input (well, I guess reporting an illegal escape instead would be fine too, as long as it's still an error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's true, it wasn't reflecting the RegEx behavior - although snippets are really just skipped during parsing, the important bit is the correct detection of the tail. That being said, it absolutely makes sense to interpret the 'special comment' literal string sensibly. Now everything that is escaped and interpretable is being interpreted as such and the rest is appended verbatim. I have added a couple tests to the scanner for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the most sensible behavior for such escapes would actually be to report an error, but that's breaking. They should never happen in Yul coming from the codegen and are only possible in user-supplied Yul, so maybe it's acceptable, buy I wouldn't do it in this PR. Maybe this is something that we should consider later though. Really, neither skipping them, nor keeping them "as is" is ideal.
Ah, right. I focused on parsing the long hex string (which should not be affected by this PR) but overlooked the fact that such a string will also end up in a snippet as a part of debug info. In that case, yeah, this PR does fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good so I'm approving already.
I still have some final suggestions, mostly regarding wording and comments. The only bigger one would to change how unterminated escapes are handled, though that's not incorrect, just leaves some unexpected artifacts in the (invalid) literal.
@@ -3,6 +3,7 @@ | |||
Language Features: | |||
* Accept declarations of state variables with ``transient`` data location (parser support only, no code generation yet). | |||
* Make ``require(bool, Error)`` available when using the legacy pipeline. | |||
* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the indices as well as single-quoted code snippets are now allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the indices as well as single-quoted code snippets are now allowed. | |
* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the location components as well as single-quoted code snippets are now allowed. |
@@ -13,6 +14,7 @@ Compiler Features: | |||
|
|||
|
|||
Bugfixes: | |||
* AsmParser: Alleviates risk of encountering a segfault for very long comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* AsmParser: Alleviates risk of encountering a segfault for very long comments. | |
* Yul Parser: Fix segfault when parsing very long location comments. |
// the second source location is not parsed as such, as the hex string isn't interpreted as snippet but | ||
// as the beginning of the tail in AsmParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this comment was duplicated, even in tests that do not include the second source location at all :) It should probably have stayed only in the one test where it was relevant?
@@ -764,7 +764,7 @@ bool Scanner::scanEscape() | |||
char c = m_char; | |||
|
|||
// Skip escaped newlines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated comment.
// Skip escaped newlines. | |
// Normally we ignore the slash just before a newline since it's meaningless. | |
// In the "special comment" mode, we preserve it though, like all invalid escapes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, one way to make this code more self-explanatory would be to add a meaningfully named parameter (e.g. _preserveInvalidEscapes
or _rejectInvalidEscapes
with the opposite meaning) instead of special-casing the SpecialComment
mode here. It would make it clearer here why we're doing something different in this mode.
Then, at the point of call that would also make it more obvious that the function will behave differently:
if (m_kind == ScannerKind::SpecialComment)
{
if (c == '\\')
scanEscape(true /* _preserveInvalidEscapes */);
if (c == '\\') | ||
scanEscape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more consistent with the other branch to treat unterminated \
as IllegalEscapeSequence
. Currently you continue, run scanEscape()
and then error with IllegalStringEndQuote
right after the loop.
It does work, but it's still weird to let scanEscape()
run when we're at the end of input and m_char
is 0
, which I guess is assumed to never actually be read. The function will actually take that 0
and put into the literal and that literal is still accessible via currentLiteral()
even when we error out.
Also, you should either handle the return value from scanEscape()
or assert that it's not supposed to fail:
if (c == '\\') | |
scanEscape(); | |
if (c == '\\') | |
{ | |
if (isSourcePastEndOfInput()) | |
return setError(ScannerError::IllegalEscapeSequence); | |
bool validEscape = scanEscape(); | |
solAssert(validEscape); | |
} |
Fixes #15207
Fixes #13496