-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ANTLR4 with homegrown LALR parser generator #38
Conversation
This should be the last time I replace the parser. Besides the change in the grammar, lexer has been rewritten (again) to be stateless. This has been accomplished using "lexer modes", which are a thing I saw mentioned in an Oil Shell blog post, and then came up with my own implementation. Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree into an AST, has been replaced with a "Simplify" pass, which turns the output of the parser into the same AST format. One nice side-effect of this is that, where if the AstBuilder didn't understand something, it was left out of the tree entirely, the Simplify pass will pass it through unchanged. Of course, we lost the dependency on ANTLR4 -- we no longer require the runtime, nor is there a separate build step after modifying the grammar.
Pull Request Test Coverage Report for Build 259
💛 - Coveralls |
In particular, having to split up the start and end rules made the resulting syntax tree awkward to work with. Also, the parsing module now features documentation explaining the particulars of why the mode-selection system works the way it does.
Required rewriting the simplify pass to use the new pattern syntax
- grammar: strings now allowed in expressions - simplify-pass: T.PUNCT_COMMA tokens removed from alists - simplify-pass: malformed integers raise ParseError instead of ValueError - pattern: P.node predicate no longer throws an exception instead of simply failing when trying to match non-node items.
A symbol can now be replaced with a tuple of symbols on the right-hand side of a rule, indicating that either is acceptable in that position. This has two effects: - makes the grammar less repetitive, especially for operators. - makes the generated parser slightly more efficient, as some states can now be merged. This has been used to simplify the main grammar. Also, BITNOT should have been unary; this has been corrected.
which currently open PRs are obsoleted by this PR? |
None of them, more or less. Of the four other PRs, #27 is a documentation-only change, and #40 is an automated dependency-update change, so neither of those are obsoleted. #34 is mostly new tests right now, so all it'll require is merging master back in, resolving a couple of minor conflicts, and some tweaking around how expression syntax is tested. #37 isn't obsoleted, but due to a combination of ill-considered refactoring activity that should have been a separate PR, some grammar changes on the old parser, and some small bits of code that are actually obsoleted, the merge conflict might be so hairy that it'd be faster for me to create one or more new branches from commits cherry-picked off of it. (Also, there's like one more changeset that I have ready for this branch which I haven't pushed to it yet, which clears up some of the awkwardness around how symbols are handled. If you give me a couple of minutes I'll push it.) |
This clears up some of the awkwardness around writing functions that handle both kinds of symbol by providing common interfaces to both of them. Additionally, it means that extended symbols are no longer handled through somewhat opaque tuple-munging, hopefully making the code more readable.
This should be the last time I replace the parser.
Besides the change in the grammar, lexer has been rewritten (again) to be stateless. This has been accomplished using "lexer modes", which are a thing I saw mentioned in an Oil Shell blog post, and then came up with my own implementation.
Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree into an AST, has been replaced with a "Simplify" pass, which turns the output of the parser into the same AST format. One nice side-effect of this is that, where if the AstBuilder didn't understand something, it was left out of the tree entirely, the Simplify pass will pass it through unchanged.
Of course, we lost the dependency on ANTLR4 -- we no longer require the runtime, nor is there a separate build step after modifying the grammar.
EDIT 2018-08-06 - Supplementary summary of overall changes relative to master:
The
jeff65.gold.compiler.parse
function should produce the same AST as before for supported syntax. For unsupported syntax, it just includes the concrete syntax tree verbatim, instead of omitting the element, which is what was happening before. Downstream code generation is therefore unaffected. The other big change is that theParseError
exception moved fromjeff65.gold
tojeff65.parsing
, which accounts for a lot of the changes intest_ast.py
which weren't new tests or tests brought back from the top-down parser days.The exact set of rules used to write the grammar (formerly in
jeff65/gold/grammar/Grammar.g4
, now injeff65/gold/grammar.py
) is slightly different, mostly to compensate for the fact that ANTLR4 grammars allow for fairly sophisticated constructs (e.g. repetition, complex alternation, optional tokens) to be written out directly, whilejeff65.parsing
has to have them spelled out in primitives. Also,jeff65.parsing
has to have rule precedence assigned explicitly, while ANTLR4 infers it based on rule order.The lexer/parser interface is completely different. Lexer code that I found myself copy-pasting in [WIP] Handle blocks of assembly/il code. #37 got factored out into the new
jeff65.parsing.ReStream
class. Additionally, we now have a lexer generator.Comments and strings are no longer handled by the lexer being stateful--strings are now handled as part of the main grammar using lexer modes, while comments are handled using a combination of lexer modes, lexer channels, and a helper parser. Basically, the lexer runs into the start-comment token, outputs it on the
HIDDEN
channel, which triggers the main parser delegating to the helper parser for that channel. The helper parser has rules which put the lexer in comment mode, which causes the lexer to stop trying to recognize keywords and stuff. Eventually it decides that the comment is over, causing the helper parser to return control back to the main parser, which then continues to use the lexer in normal mode. This probably sounds slightly more awkward than it is.The
jeff65.gold.passes.simplify.Simplify
pass, which replacesjeff65.gold.ast.AstBuilder
, uses the pattern-match-and-replace system to do it's job, instead of an explicit tree-walk. This is hopefully easier to understand and modify.Primary files of interest are
jeff65/parsing.py
,jeff65/gold/grammar.py
,jeff65/gold/passes/simplify.py
, and possiblytests/test_ast.py
.