AsmParser: parse source comment using scanner instead of regex #15209

clonker · 2024-06-20T11:58:12Z

Fixes #15207
Fixes #13496

matheusaaguiar

Just some nitpicks and small suggestions.

libyul/AsmParser.cpp

matheusaaguiar · 2024-06-20T18:16:22Z

Also, just to confirm/point to, I see that we have some tests in test/libyul/Parser.cpp which seems to cover parsing. Not sure if we need or would benefit from adding something more there.

cameel

This needs a changelog entry for the bugfix.

Also note that github is a bit picky about the fixes/closes annotation. fixes issue #15207 did not work, The PR is not connected to the issue and won't close it automatically. You need Fixes #15207, i.e. without anything in between the word and the issue number.

Also, just to confirm/point to, I see that we have some tests in test/libyul/Parser.cpp which seems to cover parsing. Not sure if we need or would benefit from adding something more there.

The repro is a bit large and otherwise trivial, so I think it's ok to skip it.

I think we're missing some coverage for whitespace and newline handling behavior, because the PR passes all tests even though the way it handles whitespace seems to have changed.

libyul/AsmParser.cpp

Changelog.md

liblangutil/Scanner.cpp

libyul/AsmParser.cpp

liblangutil/Scanner.h

aarlt · 2024-06-25T16:01:08Z

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

cameel · 2024-06-26T17:46:46Z

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

clonker · 2024-06-27T09:21:35Z

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

Invoking the same with --debug-info none does not segfault on develop, either.

aarlt · 2024-06-27T11:57:45Z

I guess you could now also remove the rm ./*_bytecode_too_large_*.sol ./*_combined_too_large_*.sol in .circleci/parallel_bytecode_report.sh - if I remember correctly, the crashes where mainly happening because of the regexp stuff.

Looks like it worked. Which is actually odd, because this PR doesn't fix #13496, does it? I mean, if it does then great, maybe we should close it too, but it's also possible that it got somehow fixed on develop and we didn't realize :)

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

Invoking the same with --debug-info none does not segfault on develop, either.

yep exactly! its because of the debug comments. I ran into this when I was working on the debug attribute parsing stuff. I would say we should merge this soon so I can rework my PR.

aarlt · 2024-06-27T12:00:20Z

In general it is a bug in std::regexp in some GCC versions that created that segfault.

Changelog.md

…ed: Whitespace between the indices as well as single-quoted code snippets are now allowed.

cameel

I think we still need some some adjustments in the scanner and a few more tests. Other than that it looks pretty good now.

cameel · 2024-06-27T16:38:50Z

liblangutil/Scanner.cpp

+			if (c == '\\')
+				scanEscape();


If I understand this code correctly, this does not quite match the behavior of the original regex. The code here will just skip invalid escape sequences and stray \ at the end of input. If we want to match the original regex, we'd have to instead:

not interpret unrecognized escapes, just keep them as is,

report an unterminated string if we reach \ at the end of input (well, I guess reporting an illegal escape instead would be fine too, as long as it's still an error).

Yeah it's true, it wasn't reflecting the RegEx behavior - although snippets are really just skipped during parsing, the important bit is the correct detection of the tail. That being said, it absolutely makes sense to interpret the 'special comment' literal string sensibly. Now everything that is escaped and interpretable is being interpreted as such and the rest is appended verbatim. I have added a couple tests to the scanner for that.

By the way, the most sensible behavior for such escapes would actually be to report an error, but that's breaking. They should never happen in Yul coming from the codegen and are only possible in user-supplied Yul, so maybe it's acceptable, buy I wouldn't do it in this PR. Maybe this is something that we should consider later though. Really, neither skipping them, nor keeping them "as is" is ideal.

libyul/AsmParser.cpp

test/libyul/Parser.cpp

cameel · 2024-06-27T19:02:26Z

It does segfault for me in the issue you linked above on develop and not on this branch when, e.g., invoking solc --via-ir --asm test.sol. I assume it is because of the debug comments:

Ah, right. I focused on parsing the long hex string (which should not be affected by this PR) but overlooked the fact that such a string will also end up in a snippet as a part of debug info. In that case, yeah, this PR does fix it.

cameel

This generally looks good so I'm approving already.

I still have some final suggestions, mostly regarding wording and comments. The only bigger one would to change how unterminated escapes are handled, though that's not incorrect, just leaves some unexpected artifacts in the (invalid) literal.

cameel · 2024-06-28T15:04:42Z

Changelog.md

@@ -3,6 +3,7 @@
 Language Features:
 * Accept declarations of state variables with ``transient`` data location (parser support only, no code generation yet).
 * Make ``require(bool, Error)`` available when using the legacy pipeline.
+ * Yul: Parsing rules for source location comments have been relaxed: Whitespace between the indices as well as single-quoted code snippets are now allowed.


Suggested change

* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the indices as well as single-quoted code snippets are now allowed.

* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the location components as well as single-quoted code snippets are now allowed.

cameel · 2024-06-28T15:07:44Z

Changelog.md

@@ -13,6 +14,7 @@ Compiler Features:


 Bugfixes:
+ * AsmParser: Alleviates risk of encountering a segfault for very long comments.


Suggested change

* AsmParser: Alleviates risk of encountering a segfault for very long comments.

* Yul Parser: Fix segfault when parsing very long location comments.

cameel · 2024-06-28T15:22:10Z

test/libyul/Parser.cpp

+	// the second source location is not parsed as such, as the hex string isn't interpreted as snippet but
+	// as the beginning of the tail in AsmParser


Looks like this comment was duplicated, even in tests that do not include the second source location at all :) It should probably have stayed only in the one test where it was relevant?

cameel · 2024-06-28T15:41:48Z

liblangutil/Scanner.cpp

@@ -764,7 +764,7 @@ bool Scanner::scanEscape()
 	char c = m_char;

 	// Skip escaped newlines.


Outdated comment.

Suggested change

// Skip escaped newlines.

// Normally we ignore the slash just before a newline since it's meaningless.

// In the "special comment" mode, we preserve it though, like all invalid escapes.

Also, one way to make this code more self-explanatory would be to add a meaningfully named parameter (e.g. _preserveInvalidEscapes or _rejectInvalidEscapes with the opposite meaning) instead of special-casing the SpecialComment mode here. It would make it clearer here why we're doing something different in this mode.

Then, at the point of call that would also make it more obvious that the function will behave differently:

if (m_kind == ScannerKind::SpecialComment) { if (c == '\\') scanEscape(true /* _preserveInvalidEscapes */);

cameel · 2024-06-28T16:12:42Z

liblangutil/Scanner.cpp

+			if (c == '\\')
+				scanEscape();


It would be more consistent with the other branch to treat unterminated \ as IllegalEscapeSequence. Currently you continue, run scanEscape() and then error with IllegalStringEndQuote right after the loop.

It does work, but it's still weird to let scanEscape() run when we're at the end of input and m_char is 0, which I guess is assumed to never actually be read. The function will actually take that 0 and put into the literal and that literal is still accessible via currentLiteral() even when we error out.

Also, you should either handle the return value from scanEscape() or assert that it's not supposed to fail:

Suggested change

if (c == '\\')

scanEscape();

if (c == '\\')

{

if (isSourcePastEndOfInput())

return setError(ScannerError::IllegalEscapeSequence);

bool validEscape = scanEscape();

solAssert(validEscape);

}

clonker force-pushed the asm_parser_use_scanner branch from ff37008 to f3b731e Compare June 20, 2024 12:41

clonker marked this pull request as ready for review June 20, 2024 13:04

matheusaaguiar reviewed Jun 20, 2024

View reviewed changes

libyul/AsmParser.cpp Outdated Show resolved Hide resolved

libyul/AsmParser.cpp Outdated Show resolved Hide resolved

libyul/AsmParser.cpp Outdated Show resolved Hide resolved

libyul/AsmParser.cpp Outdated Show resolved Hide resolved

cameel added refactor and removed refactor labels Jun 20, 2024

cameel reviewed Jun 21, 2024

View reviewed changes

nikola-matic reviewed Jun 21, 2024

View reviewed changes

libyul/AsmParser.cpp Outdated Show resolved Hide resolved

clonker force-pushed the asm_parser_use_scanner branch 2 times, most recently from 7cf8e50 to b3fd10e Compare June 21, 2024 16:21

cameel reviewed Jun 21, 2024

View reviewed changes

clonker force-pushed the asm_parser_use_scanner branch from b3fd10e to 5bca035 Compare June 24, 2024 08:21

clonker mentioned this pull request Jun 24, 2024

Numerical Yul node id handles #15215

Draft

clonker force-pushed the asm_parser_use_scanner branch from 5bca035 to a1fb95f Compare June 26, 2024 14:06

aarlt reviewed Jun 27, 2024

View reviewed changes

Changelog.md Outdated Show resolved Hide resolved

clonker added 2 commits June 27, 2024 14:24

AsmParser: Parsing rules for source location comments have been relax…

9f43e34

…ed: Whitespace between the indices as well as single-quoted code snippets are now allowed.

re-enable bytecode_too_large tests

64ab6be

clonker force-pushed the asm_parser_use_scanner branch from a1fb95f to 64ab6be Compare June 27, 2024 12:25

cameel reviewed Jun 27, 2024

View reviewed changes

clonker added 3 commits June 28, 2024 07:57

f

17774c1

Scanner: special comments append invalid escapes verbatim

54d976c

invalid scanner tokens beyond unterminated quotes are ignored

a0cea0c

cameel approved these changes Jun 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsmParser: parse source comment using scanner instead of regex #15209

AsmParser: parse source comment using scanner instead of regex #15209

clonker commented Jun 20, 2024 •

edited

Loading

matheusaaguiar left a comment

matheusaaguiar commented Jun 20, 2024 •

edited

Loading

cameel left a comment

aarlt commented Jun 25, 2024

cameel commented Jun 26, 2024

clonker commented Jun 27, 2024

aarlt commented Jun 27, 2024

aarlt commented Jun 27, 2024 •

edited

Loading

cameel left a comment

cameel Jun 27, 2024

clonker Jun 28, 2024 •

edited

Loading

cameel Jun 28, 2024

cameel commented Jun 27, 2024 •

edited

Loading

cameel left a comment

cameel Jun 28, 2024

cameel Jun 28, 2024

cameel Jun 28, 2024

cameel Jun 28, 2024

cameel Jun 28, 2024 •

edited

Loading

cameel Jun 28, 2024

	* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the indices as well as single-quoted code snippets are now allowed.
	* Yul: Parsing rules for source location comments have been relaxed: Whitespace between the location components as well as single-quoted code snippets are now allowed.

		@@ -13,6 +14,7 @@ Compiler Features:


		Bugfixes:
		* AsmParser: Alleviates risk of encountering a segfault for very long comments.

	* AsmParser: Alleviates risk of encountering a segfault for very long comments.
	* Yul Parser: Fix segfault when parsing very long location comments.

		// the second source location is not parsed as such, as the hex string isn't interpreted as snippet but
		// as the beginning of the tail in AsmParser

		@@ -764,7 +764,7 @@ bool Scanner::scanEscape()
		char c = m_char;

		// Skip escaped newlines.

	// Skip escaped newlines.
	// Normally we ignore the slash just before a newline since it's meaningless.
	// In the "special comment" mode, we preserve it though, like all invalid escapes.

-			if (c == '\\')
-				scanEscape();
+			if (c == '\\')
+			{
+				if (isSourcePastEndOfInput())
+					return setError(ScannerError::IllegalEscapeSequence);
+				bool validEscape = scanEscape();
+				solAssert(validEscape);
+			}

AsmParser: parse source comment using scanner instead of regex #15209

Are you sure you want to change the base?

AsmParser: parse source comment using scanner instead of regex #15209

Conversation

clonker commented Jun 20, 2024 • edited Loading

matheusaaguiar left a comment

Choose a reason for hiding this comment

matheusaaguiar commented Jun 20, 2024 • edited Loading

cameel left a comment

Choose a reason for hiding this comment

aarlt commented Jun 25, 2024

cameel commented Jun 26, 2024

clonker commented Jun 27, 2024

aarlt commented Jun 27, 2024

aarlt commented Jun 27, 2024 • edited Loading

cameel left a comment

Choose a reason for hiding this comment

cameel Jun 27, 2024

Choose a reason for hiding this comment

clonker Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

cameel commented Jun 27, 2024 • edited Loading

cameel left a comment

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

cameel Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

cameel Jun 28, 2024

Choose a reason for hiding this comment

clonker commented Jun 20, 2024 •

edited

Loading

matheusaaguiar commented Jun 20, 2024 •

edited

Loading

aarlt commented Jun 27, 2024 •

edited

Loading

clonker Jun 28, 2024 •

edited

Loading

cameel commented Jun 27, 2024 •

edited

Loading

cameel Jun 28, 2024 •

edited

Loading