parser lexer #112

harrysarson · 2020-08-04T22:39:59Z

Very light on details here, I will flesh out the chat soon. In short, this will be an attempt to replace our current parser with a lexer/parser two stage parser (see this screencast from which I have taken inspiration).

The first commit adds a lexer that can (maybe, probably not) parse any valid elm program. Moreover, it should never fail to parse anything ever (see the Invalid LexItem into which I will stick anything that I cannot parse properly). I aim for the text => [ LexItems ] => text conversation change to never be lossy.

harrysarson · 2020-08-04T22:42:34Z

The only really interesting thing here at this point in time (and the reason I am putting this PR out) is the LexItem datastructure

compiler/src/Stage/Parse/Lexer.elm

Lines 8 to 73 in a216421

    
           type LexItem 
        
               = Sigil LexSigil 
        
               | Token String 
        
               | NumericLiteral String 
        
               | TextLiteral LexLiteralType String 
        
               | Whitespace Int 
        
               | Newline Int 
        
               | Comment LexCommentType String 
        
               | Invalid String 
        
           type LexSigil 
        
               = Bracket BracketType BracketTodoNameMe 
        
               | Assign 
        
               | Pipe 
        
               | Comma 
        
               | SingleDot 
        
               | DoubleDot 
        
               | ThinArrow 
        
               | Backslash 
        
               | Underscore 
        
               | Colon 
        
               | BinaryOperator LexBinaryOperator 
        
           type LexCommentType 
        
               = LineComment 
        
               | MutlilineComment 
        
               | DocComment 
        
           type LexBinaryOperator 
        
               = Add 
        
               | Subtract 
        
               | Multiply 
        
               | Divide 
        
               | Exponentiate 
        
               | And 
        
               | Or 
        
               | Equals 
        
               | GreaterThan 
        
               | GreaterThanEquals 
        
               | LessThan 
        
               | LessThanEquals 
        
               | Append 
        
           type BracketType 
        
               = Round 
        
               | Square 
        
               | Curly 
        
           type BracketTodoNameMe 
        
               = Open 
        
               | Close 
        
           type LexLiteralType 
        
               = StringL StringType 
        
               | CharL 
        
           type StringType 
        
               = Single 
        
               | Triple

Thoughts on these custom types would be very much appreciated!

harrysarson · 2020-08-05T20:11:24Z

In my experience of elm/parser parser, the most painful thing is understanding what the state of parser is when it fails. The parser

kindly spits out the location of the parser error
gives some information about what it was trying to do when it failed
but gives no information about its state when it fails.

If the lexer/parser approach is going to be superior to the elm/parser parser then it must do as good with (1) and (2) (which are the error information that users will see) but also help with (3) (this information is key to debugging or hacking around with the compiler).

I think the reason elm/parser parsers are poor at (3) is because they are a wrapper around an opaque function. Functions are in there nature impossible to compare and hard to visualise. Therefore, when I am designing the parser part of the lexer/parser I need to avoid the temptation to use function composition/lots of recursion. Instead I should try to craft some custom types which describe the state of the parser (what fraction of the AST it has produced so far) and compose these types. Then, if there is a problem I can dump this fraction of the AST which should help with debugging.

The lexer has no idea whether an operator is binary or not! Let's not pretend it does

harrysarson · 2020-08-05T23:06:34Z

I think this is how I want to structure the parser part of the lexer/parser. (An entity is a word I assigned today to anything in an elm file consisting of a line with no indentation followed by zero or more lines with indentation --- for example a custom type declaration, a type annotation, a module XX exposing (..), etc).

type State
    = EntityStart
    | EntityFirstItem EntityFirstItemType

type EntityFirstItem
    = EntityFirstItemType
    | EntityFirstItemModule
    | EntityFirstItemName
    | ...

type Error
    = **InvalidStartToEntity**
    | MisplacedKeyword Keyword

parser : List (LexItem) -> SomeResult
parser items =
    parserHelp items EntityStart


parserHelp : List (LexItem) -> State -> Result Error Ast
parserHelp items state =
    case items of
        item :: rest ->
            let
                newState =
                    case state of
                        EntityStart -> parseEntityStart item
                        EntityFirstItem EntityFirstItemType -> parseTypeEntity item
                        EntityFirstItem EntityFirstItemModule -> parseModuleEntity item
                        EntityFirstItem EntityFirstItemName -> parseValueDeclarationEntity item

            in  
            case newState of ->
                Ok newState_ -> parserHelp rest newState_
                Err Error -> Err Error

        [] ->
            Ok (AstFromSate state)


parseEntityStart : LexItem -> Result Error State
parseEntityStart item =
    case item of
        Lexer.Token str -> 
            case toKeyword str of
                Just TypeKeyword -> Ok EntityFirstItemType
                Just ModuleKeyword -> Ok EntityFirstItemModule

                Just other -> Ok MisplacedKeyword other

                Nothing -> Ok EntityFirstItemName str

        _ -> Err InvalidStartToEntity item

Janiczek · 2020-08-06T07:34:02Z

tests/LexerTest.elm

+                , ( "works with nested module name"
+                  , "module Foo.Bar exposing (..)"
+                  )


I would love these tests to not only check that there are no invalid items and that the roundtrip from/to String went OK, but show what the lexed items are... for learning purposes.

Eg. if I saw

( "works with nested module name" , "module Foo.Bar exposing (..)" , [Token "module", ...] -- purposefully stripped of location info, to keep signal/noise ratio up )

that would help me understand what do the results of this lexing process look like.

Kinda like you did in the test just above this one :)

Yeah, I went for the quickest way to make the Parser tests work for my Lexer (which was just to remove the bits that required any meaning change).

To complete your example:

( "works with nested module name" , "module Foo.Bar exposing (..)" , [Token "module", Whitespace 1, Token "Foo", Sigil FullStop, Token "Bar", Whitespace 1, Token "exposing", Whitespace 1, Sigil (Bracket Round Open), Sigil DoubleDot, Sigil (Bracket Round Close) ]" )

or something along these lines. I.E very long! Unless I can find some way of automating I do not think I can face typing that out for all 100 or so tests in the LexerTests.elm file!

Fair enough! The tests were previously written incrementally so the cost was ... amortized :) But right now it's not a huge priority, so I agree with you - let's not add those examples right now.

Janiczek · 2020-08-06T07:36:38Z

Very interesting, thanks for prototyping that @harrysarson ! How do you feel about it so far? (Let's take this to the Discord if you want.)

I feel like we still have too little information to decide on whether to use this approach or not, since the other part that will take these LexItems and produce AST still needs to be written (if I'm correct), but so far I see no problems/blockers that would make me want to scratch this idea 🙂 👍

harrysarson · 2020-08-06T12:30:07Z

I feel good! The progress I have made so far definately counts more as exploration than implementation. It may be that we decide to scrap this entirely and I do not think that would be a bad outcome because already I understand the problem scope a lot more than I did before. (For example, I am a big fan of these 3 aims for whatever solution we go for).

No rush either, I definitetly want to spend some time sitting on these ideas before committing myself to them.

harrysarson · 2020-08-10T18:12:13Z

I have run out of steam here for now. Closing in the hope that I one day soon cycle back to this and make more progress.

Janiczek · 2020-08-11T08:09:41Z

Thanks @harrysarson for the experiment! Yeah we can always circle back to it :)

We will have a lot of constructors and this namespacing will help tell them apart.

run `elm repl` from ./tests/parser-tests to see live parsing!

and fix contextualisation so that the new cases do not cause crashes

we return the current state seperately so this is just duplication

and fix crashes (and buggy parsing of bad syntax into a valid AST).

harrysarson · 2020-09-25T11:34:24Z

I have circled back. I am very excited about this approach. Still very rough but I can parse type aliases that do not include records or tuples!

@Janiczek as part of this PR I have experimented with auto generating tests. I have a directory of elm syntax snippets and a node/elm program that generates test cases with sources, the elm list of lexed items and the pasrsed AST. Then I use each test case for a couple of tests. Pretty cool! Currently it relies on Debug.toString producing valid elm code if I can the correct glob imports. Even if parser/lexer turns out to be a dead end, I think testing method is valuable.

we also add a special case into our test case update script so that we can write test cases and prefix them with an underscore to exclude them from generation

adds some classes of item

I found this to be nice for value expressions. So using it for type expressions too.

Oops! We should leave "error recovery mode" when we see a non-indented line. We were not doing so.

This was an extra layer we do not need. Expressions are never done until they have to be done due to a newline. Therefore, we only ever used the progress variant unless we are parsing the newline token. Because we now parse newline tokens at a nice an high level of the parser we can deal with them nicely.

needs tidy up next commit

Lovely!

the logic is clearer when we have two separate functions

We had this when I wanted uniform return types. Much better now though.

removes another case of passing a callback to a function. This time we use a new custom type. Much more elmish!

still a messy function, needs more love.

`exprAppend` becomes `appendPossiblyQualifiedTokenTo` and `parentsToLeafWith` becomes `appendTypeExprTo`. These names do a much better job at explaining what these functions do.

lex any elm program!

a216421

add some operator tests

1dacf19

drop the Binary from BinaryOperator

39d0527

The lexer has no idea whether an operator is binary or not! Let's not pretend it does

Janiczek reviewed Aug 6, 2020

View reviewed changes

harrysarson closed this Aug 10, 2020

harrysarson added 13 commits September 24, 2020 08:22

sketch of contextualising

1e09457

prefix constructors with custom type name

664bc01

We will have a lot of constructors and this namespacing will help tell them apart.

we can parse "type alias Model = List Int"!

f926996

add visualiser

ec5ca01

run `elm repl` from ./tests/parser-tests to see live parsing!

refactor function layout

d9c7650

use explicit parse result type

cced4b7

add automatic parser/test test generation

b4b00ae

add extra parser/lexer test cases

00a8ebe

and fix contextualisation so that the new cases do not cause crashes

improve test case formatting

5b72af1

remove current state from errors

abda9e1

we return the current state seperately so this is just duplication

add type alias test cases with brackets

370eeeb

and fix crashes (and buggy parsing of bad syntax into a valid AST).

error in regenerate on panic

55e6b04

reject multiline types with no indentation

cd97b4c

harrysarson reopened this Sep 25, 2020

harrysarson added 3 commits September 25, 2020 16:22

rework nesting

7322d79

we also add a special case into our test case update script so that we can write test cases and prefix them with an underscore to exclude them from generation

fix meaning of type does not take args error

2be359c

rework handling of complete type aliases

b57721f

harrysarson added 30 commits October 7, 2020 13:12

proper comparison of operator precedence

12df300

proper collapsing of operators

ca09eed

parse operators with different precedence

a5c948b

add locations to operators

878d58d

lex X.Y.z as single items

a6ee783

lex .thing as a record accessor literal or function.

b062737

restructure lexer

a355184

adds some classes of item

make newlines a top level lex thing

1d9c0e1

apply intellij suggestions

3c04019

fix printy printing

4090dd0

stop not using value

fe884e8

remove "Partial" from TypeExpression

9d47040

use parent struct from TypeExpressions

ad3d8a0

I found this to be nice for value expressions. So using it for type expressions too.

wip: case sensitive lexing

790b3c5

do not stop parsing on first error

f02c872

Oops! We should leave "error recovery mode" when we see a non-indented line. We were not doing so.

use normal results in more places

776ebcf

needs tidy up next commit

more sane results in return types

33a7a7e

undo elm format for block

a572cae

inline collapseNesting calls

05428f0

do not call a type partial when it is not

d2489a5

make impossible states impossible

0564e9c

Lovely!

separate collapsing functions

49f6497

the logic is clearer when we have two separate functions

allow qualified type names at top level

4b91f1d

move post processing out of fn

5e48cd3

We had this when I wanted uniform return types. Much better now though.

tidy parseLowercaseArgsOrAssignment

14537f3

removes another case of passing a callback to a function. This time we use a new custom type. Much more elmish!

Include the (partial) type arg in Error_TypeDoesNotTakeArgs

96c3269

tidy leaf to parent (slightly)

626b543

still a messy function, needs more love.

rename functions

df9e3cf

`exprAppend` becomes `appendPossiblyQualifiedTokenTo` and `parentsToLeafWith` becomes `appendTypeExprTo`. These names do a much better job at explaining what these functions do.

more type trickery to avoid callbacks

4eaf664

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser lexer #112

parser lexer #112

harrysarson commented Aug 4, 2020

harrysarson commented Aug 4, 2020

harrysarson commented Aug 5, 2020

harrysarson commented Aug 5, 2020

Janiczek Aug 6, 2020

harrysarson Aug 6, 2020

Janiczek Aug 6, 2020

Janiczek commented Aug 6, 2020

harrysarson commented Aug 6, 2020

harrysarson commented Aug 10, 2020

Janiczek commented Aug 11, 2020

harrysarson commented Sep 25, 2020 •

edited

Loading

parser lexer #112

Are you sure you want to change the base?

parser lexer #112

Conversation

harrysarson commented Aug 4, 2020

harrysarson commented Aug 4, 2020

harrysarson commented Aug 5, 2020

harrysarson commented Aug 5, 2020

Janiczek Aug 6, 2020

Choose a reason for hiding this comment

harrysarson Aug 6, 2020

Choose a reason for hiding this comment

Janiczek Aug 6, 2020

Choose a reason for hiding this comment

Janiczek commented Aug 6, 2020

harrysarson commented Aug 6, 2020

harrysarson commented Aug 10, 2020

Janiczek commented Aug 11, 2020

harrysarson commented Sep 25, 2020 • edited Loading

harrysarson commented Sep 25, 2020 •

edited

Loading