Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ANTLR4 with homegrown LALR parser generator #38

Merged
merged 11 commits into from
Aug 8, 2018
Merged

Replace ANTLR4 with homegrown LALR parser generator #38

merged 11 commits into from
Aug 8, 2018

Conversation

jdpage
Copy link
Collaborator

@jdpage jdpage commented Aug 2, 2018

This should be the last time I replace the parser.

Besides the change in the grammar, lexer has been rewritten (again) to be stateless. This has been accomplished using "lexer modes", which are a thing I saw mentioned in an Oil Shell blog post, and then came up with my own implementation.

Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree into an AST, has been replaced with a "Simplify" pass, which turns the output of the parser into the same AST format. One nice side-effect of this is that, where if the AstBuilder didn't understand something, it was left out of the tree entirely, the Simplify pass will pass it through unchanged.

Of course, we lost the dependency on ANTLR4 -- we no longer require the runtime, nor is there a separate build step after modifying the grammar.

EDIT 2018-08-06 - Supplementary summary of overall changes relative to master:

  • The jeff65.gold.compiler.parse function should produce the same AST as before for supported syntax. For unsupported syntax, it just includes the concrete syntax tree verbatim, instead of omitting the element, which is what was happening before. Downstream code generation is therefore unaffected. The other big change is that the ParseError exception moved from jeff65.gold to jeff65.parsing, which accounts for a lot of the changes in test_ast.py which weren't new tests or tests brought back from the top-down parser days.

  • The exact set of rules used to write the grammar (formerly in jeff65/gold/grammar/Grammar.g4, now in jeff65/gold/grammar.py) is slightly different, mostly to compensate for the fact that ANTLR4 grammars allow for fairly sophisticated constructs (e.g. repetition, complex alternation, optional tokens) to be written out directly, while jeff65.parsing has to have them spelled out in primitives. Also, jeff65.parsing has to have rule precedence assigned explicitly, while ANTLR4 infers it based on rule order.

  • The lexer/parser interface is completely different. Lexer code that I found myself copy-pasting in [WIP] Handle blocks of assembly/il code. #37 got factored out into the new jeff65.parsing.ReStream class. Additionally, we now have a lexer generator.

  • Comments and strings are no longer handled by the lexer being stateful--strings are now handled as part of the main grammar using lexer modes, while comments are handled using a combination of lexer modes, lexer channels, and a helper parser. Basically, the lexer runs into the start-comment token, outputs it on the HIDDEN channel, which triggers the main parser delegating to the helper parser for that channel. The helper parser has rules which put the lexer in comment mode, which causes the lexer to stop trying to recognize keywords and stuff. Eventually it decides that the comment is over, causing the helper parser to return control back to the main parser, which then continues to use the lexer in normal mode. This probably sounds slightly more awkward than it is.

  • The jeff65.gold.passes.simplify.Simplify pass, which replaces jeff65.gold.ast.AstBuilder, uses the pattern-match-and-replace system to do it's job, instead of an explicit tree-walk. This is hopefully easier to understand and modify.

Primary files of interest are jeff65/parsing.py, jeff65/gold/grammar.py, jeff65/gold/passes/simplify.py, and possibly tests/test_ast.py.

This should be the last time I replace the parser.

Besides the change in the grammar, lexer has been rewritten (again) to
be stateless. This has been accomplished using "lexer modes", which are
a thing I saw mentioned in an Oil Shell blog post, and then came up with
my own implementation.

Also, the old "AstBuilder" class, which turned the ANTLR4 parse tree
into an AST, has been replaced with a "Simplify" pass, which turns the
output of the parser into the same AST format. One nice side-effect of
this is that, where if the AstBuilder didn't understand something, it
was left out of the tree entirely, the Simplify pass will pass it
through unchanged.

Of course, we lost the dependency on ANTLR4 -- we no longer require the
runtime, nor is there a separate build step after modifying the grammar.
@jdpage jdpage requested a review from woodrowbarlow August 2, 2018 07:38
@coveralls
Copy link

coveralls commented Aug 2, 2018

Pull Request Test Coverage Report for Build 259

  • 532 of 554 (96.03%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+4.002%) to 89.057%

Changes Missing Coverage Covered Lines Changed/Added Lines %
jeff65/gold/pattern.py 12 14 85.71%
jeff65/parsing.py 412 432 95.37%
Totals Coverage Status
Change from base Build 254: 4.002%
Covered Lines: 1440
Relevant Lines: 1610

💛 - Coveralls

jdpage added 7 commits August 3, 2018 17:07
In particular, having to split up the start and end rules made the
resulting syntax tree awkward to work with. Also, the parsing module now
features documentation explaining the particulars of why the
mode-selection system works the way it does.
Required rewriting the simplify pass to use the new pattern syntax
 - grammar: strings now allowed in expressions

 - simplify-pass: T.PUNCT_COMMA tokens removed from alists

 - simplify-pass: malformed integers raise ParseError instead of
   ValueError

 - pattern: P.node predicate no longer throws an exception instead of
   simply failing when trying to match non-node items.
A symbol can now be replaced with a tuple of symbols on the right-hand
side of a rule, indicating that either is acceptable in that position.
This has two effects:

 - makes the grammar less repetitive, especially for operators.

 - makes the generated parser slightly more efficient, as some states
   can now be merged.

This has been used to simplify the main grammar. Also, BITNOT should
have been unary; this has been corrected.
@woodrowbarlow
Copy link
Collaborator

which currently open PRs are obsoleted by this PR?

@jdpage
Copy link
Collaborator Author

jdpage commented Aug 6, 2018

None of them, more or less. Of the four other PRs, #27 is a documentation-only change, and #40 is an automated dependency-update change, so neither of those are obsoleted.

#34 is mostly new tests right now, so all it'll require is merging master back in, resolving a couple of minor conflicts, and some tweaking around how expression syntax is tested.

#37 isn't obsoleted, but due to a combination of ill-considered refactoring activity that should have been a separate PR, some grammar changes on the old parser, and some small bits of code that are actually obsoleted, the merge conflict might be so hairy that it'd be faster for me to create one or more new branches from commits cherry-picked off of it.

(Also, there's like one more changeset that I have ready for this branch which I haven't pushed to it yet, which clears up some of the awkwardness around how symbols are handled. If you give me a couple of minutes I'll push it.)

jdpage added 2 commits August 6, 2018 17:41
This clears up some of the awkwardness around writing functions that
handle both kinds of symbol by providing common interfaces to both of
them. Additionally, it means that extended symbols are no longer handled
through somewhat opaque tuple-munging, hopefully making the code more
readable.
@jdpage jdpage added this to the Hello world! milestone Aug 6, 2018
@jdpage jdpage merged commit 4036e41 into master Aug 8, 2018
@jdpage jdpage deleted the pgen branch August 9, 2018 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants