Replace ANTLR4 with homegrown LALR parser generator #38

jdpage · 2018-08-02T07:37:19Z

This should be the last time I replace the parser.

Besides the change in the grammar, lexer has been rewritten (again) to be stateless. This has been accomplished using "lexer modes", which are a thing I saw mentioned in an Oil Shell blog post, and then came up with my own implementation.

Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree into an AST, has been replaced with a "Simplify" pass, which turns the output of the parser into the same AST format. One nice side-effect of this is that, where if the AstBuilder didn't understand something, it was left out of the tree entirely, the Simplify pass will pass it through unchanged.

Of course, we lost the dependency on ANTLR4 -- we no longer require the runtime, nor is there a separate build step after modifying the grammar.

EDIT 2018-08-06 - Supplementary summary of overall changes relative to master:

The jeff65.gold.compiler.parse function should produce the same AST as before for supported syntax. For unsupported syntax, it just includes the concrete syntax tree verbatim, instead of omitting the element, which is what was happening before. Downstream code generation is therefore unaffected. The other big change is that the ParseError exception moved from jeff65.gold to jeff65.parsing, which accounts for a lot of the changes in test_ast.py which weren't new tests or tests brought back from the top-down parser days.
The exact set of rules used to write the grammar (formerly in jeff65/gold/grammar/Grammar.g4, now in jeff65/gold/grammar.py) is slightly different, mostly to compensate for the fact that ANTLR4 grammars allow for fairly sophisticated constructs (e.g. repetition, complex alternation, optional tokens) to be written out directly, while jeff65.parsing has to have them spelled out in primitives. Also, jeff65.parsing has to have rule precedence assigned explicitly, while ANTLR4 infers it based on rule order.
The lexer/parser interface is completely different. Lexer code that I found myself copy-pasting in [WIP] Handle blocks of assembly/il code. #37 got factored out into the new jeff65.parsing.ReStream class. Additionally, we now have a lexer generator.
Comments and strings are no longer handled by the lexer being stateful--strings are now handled as part of the main grammar using lexer modes, while comments are handled using a combination of lexer modes, lexer channels, and a helper parser. Basically, the lexer runs into the start-comment token, outputs it on the HIDDEN channel, which triggers the main parser delegating to the helper parser for that channel. The helper parser has rules which put the lexer in comment mode, which causes the lexer to stop trying to recognize keywords and stuff. Eventually it decides that the comment is over, causing the helper parser to return control back to the main parser, which then continues to use the lexer in normal mode. This probably sounds slightly more awkward than it is.
The jeff65.gold.passes.simplify.Simplify pass, which replaces jeff65.gold.ast.AstBuilder, uses the pattern-match-and-replace system to do it's job, instead of an explicit tree-walk. This is hopefully easier to understand and modify.

Primary files of interest are jeff65/parsing.py, jeff65/gold/grammar.py, jeff65/gold/passes/simplify.py, and possibly tests/test_ast.py.

This should be the last time I replace the parser. Besides the change in the grammar, lexer has been rewritten (again) to be stateless. This has been accomplished using "lexer modes", which are a thing I saw mentioned in an Oil Shell blog post, and then came up with my own implementation. Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree into an AST, has been replaced with a "Simplify" pass, which turns the output of the parser into the same AST format. One nice side-effect of this is that, where if the AstBuilder didn't understand something, it was left out of the tree entirely, the Simplify pass will pass it through unchanged. Of course, we lost the dependency on ANTLR4 -- we no longer require the runtime, nor is there a separate build step after modifying the grammar.

coveralls · 2018-08-02T07:39:03Z

Pull Request Test Coverage Report for Build 259

532 of 554 (96.03%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+4.002%) to 89.057%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
jeff65/gold/pattern.py	12	14	85.71%
jeff65/parsing.py	412	432	95.37%

Totals
Change from base Build 254:	4.002%
Covered Lines:	1440
Relevant Lines:	1610

💛 - Coveralls

In particular, having to split up the start and end rules made the resulting syntax tree awkward to work with. Also, the parsing module now features documentation explaining the particulars of why the mode-selection system works the way it does.

Required rewriting the simplify pass to use the new pattern syntax

- grammar: strings now allowed in expressions - simplify-pass: T.PUNCT_COMMA tokens removed from alists - simplify-pass: malformed integers raise ParseError instead of ValueError - pattern: P.node predicate no longer throws an exception instead of simply failing when trying to match non-node items.

A symbol can now be replaced with a tuple of symbols on the right-hand side of a rule, indicating that either is acceptable in that position. This has two effects: - makes the grammar less repetitive, especially for operators. - makes the generated parser slightly more efficient, as some states can now be merged. This has been used to simplify the main grammar. Also, BITNOT should have been unary; this has been corrected.

woodrowbarlow · 2018-08-06T13:56:59Z

which currently open PRs are obsoleted by this PR?

jdpage · 2018-08-06T21:27:38Z

None of them, more or less. Of the four other PRs, #27 is a documentation-only change, and #40 is an automated dependency-update change, so neither of those are obsoleted.

#34 is mostly new tests right now, so all it'll require is merging master back in, resolving a couple of minor conflicts, and some tweaking around how expression syntax is tested.

#37 isn't obsoleted, but due to a combination of ill-considered refactoring activity that should have been a separate PR, some grammar changes on the old parser, and some small bits of code that are actually obsoleted, the merge conflict might be so hairy that it'd be faster for me to create one or more new branches from commits cherry-picked off of it.

(Also, there's like one more changeset that I have ready for this branch which I haven't pushed to it yet, which clears up some of the awkwardness around how symbols are handled. If you give me a couple of minutes I'll push it.)

This clears up some of the awkwardness around writing functions that handle both kinds of symbol by providing common interfaces to both of them. Additionally, it means that extended symbols are no longer handled through somewhat opaque tuple-munging, hopefully making the code more readable.

jdpage requested a review from woodrowbarlow August 2, 2018 07:38

jdpage added 7 commits August 3, 2018 17:07

Merge branch 'master' into pgen

0e4ad31

Required rewriting the simplify pass to use the new pattern syntax

Update runtime dependencies in setup.py

f96aa49

Merge branch 'master' into pgen

4390964

Merge branch 'master' into pgen

a8e2367

jdpage added 2 commits August 6, 2018 17:41

Merge branch 'master' into pgen

89dca37

jdpage added this to the Hello world! milestone Aug 6, 2018

jdpage added syntax internals labels Aug 6, 2018

Use named precedence levels instead of numeric ones

599f071

woodrowbarlow approved these changes Aug 8, 2018

View reviewed changes

jdpage merged commit 4036e41 into master Aug 8, 2018

jdpage deleted the pgen branch August 9, 2018 16:02

jdpage modified the milestones: Hello world!, Infrastructure Stability Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ANTLR4 with homegrown LALR parser generator #38

Replace ANTLR4 with homegrown LALR parser generator #38

jdpage commented Aug 2, 2018 •

edited

Loading

coveralls commented Aug 2, 2018 •

edited

Loading

woodrowbarlow commented Aug 6, 2018

jdpage commented Aug 6, 2018

Replace ANTLR4 with homegrown LALR parser generator #38

Replace ANTLR4 with homegrown LALR parser generator #38

Conversation

jdpage commented Aug 2, 2018 • edited Loading

coveralls commented Aug 2, 2018 • edited Loading

Pull Request Test Coverage Report for Build 259

💛 - Coveralls

woodrowbarlow commented Aug 6, 2018

jdpage commented Aug 6, 2018

jdpage commented Aug 2, 2018 •

edited

Loading

coveralls commented Aug 2, 2018 •

edited

Loading