Skip to content
This repository has been archived by the owner on Oct 5, 2024. It is now read-only.

parser lexer #112

Draft
wants to merge 103 commits into
base: master
Choose a base branch
from
Draft

parser lexer #112

wants to merge 103 commits into from

Conversation

harrysarson
Copy link
Contributor

Very light on details here, I will flesh out the chat soon. In short, this will be an attempt to replace our current parser with a lexer/parser two stage parser (see this screencast from which I have taken inspiration).

The first commit adds a lexer that can (maybe, probably not) parse any valid elm program. Moreover, it should never fail to parse anything ever (see the Invalid LexItem into which I will stick anything that I cannot parse properly). I aim for the text => [ LexItems ] => text conversation change to never be lossy.

@harrysarson
Copy link
Contributor Author

The only really interesting thing here at this point in time (and the reason I am putting this PR out) is the LexItem datastructure

type LexItem
= Sigil LexSigil
| Token String
| NumericLiteral String
| TextLiteral LexLiteralType String
| Whitespace Int
| Newline Int
| Comment LexCommentType String
| Invalid String
type LexSigil
= Bracket BracketType BracketTodoNameMe
| Assign
| Pipe
| Comma
| SingleDot
| DoubleDot
| ThinArrow
| Backslash
| Underscore
| Colon
| BinaryOperator LexBinaryOperator
type LexCommentType
= LineComment
| MutlilineComment
| DocComment
type LexBinaryOperator
= Add
| Subtract
| Multiply
| Divide
| Exponentiate
| And
| Or
| Equals
| GreaterThan
| GreaterThanEquals
| LessThan
| LessThanEquals
| Append
type BracketType
= Round
| Square
| Curly
type BracketTodoNameMe
= Open
| Close
type LexLiteralType
= StringL StringType
| CharL
type StringType
= Single
| Triple

Thoughts on these custom types would be very much appreciated!

@harrysarson
Copy link
Contributor Author

In my experience of elm/parser parser, the most painful thing is understanding what the state of parser is when it fails. The parser

  1. kindly spits out the location of the parser error
  2. gives some information about what it was trying to do when it failed
  3. but gives no information about its state when it fails.

If the lexer/parser approach is going to be superior to the elm/parser parser then it must do as good with (1) and (2) (which are the error information that users will see) but also help with (3) (this information is key to debugging or hacking around with the compiler).

I think the reason elm/parser parsers are poor at (3) is because they are a wrapper around an opaque function. Functions are in there nature impossible to compare and hard to visualise. Therefore, when I am designing the parser part of the lexer/parser I need to avoid the temptation to use function composition/lots of recursion. Instead I should try to craft some custom types which describe the state of the parser (what fraction of the AST it has produced so far) and compose these types. Then, if there is a problem I can dump this fraction of the AST which should help with debugging.

The lexer has no idea whether an operator is binary or not! Let's
not pretend it does
@harrysarson
Copy link
Contributor Author

I think this is how I want to structure the parser part of the lexer/parser. (An entity is a word I assigned today to anything in an elm file consisting of a line with no indentation followed by zero or more lines with indentation --- for example a custom type declaration, a type annotation, a module XX exposing (..), etc).

type State
    = EntityStart
    | EntityFirstItem EntityFirstItemType

type EntityFirstItem
    = EntityFirstItemType
    | EntityFirstItemModule
    | EntityFirstItemName
    | ...

type Error
    = **InvalidStartToEntity**
    | MisplacedKeyword Keyword

parser : List (LexItem) -> SomeResult
parser items =
    parserHelp items EntityStart


parserHelp : List (LexItem) -> State -> Result Error Ast
parserHelp items state =
    case items of
        item :: rest ->
            let
                newState =
                    case state of
                        EntityStart -> parseEntityStart item
                        EntityFirstItem EntityFirstItemType -> parseTypeEntity item
                        EntityFirstItem EntityFirstItemModule -> parseModuleEntity item
                        EntityFirstItem EntityFirstItemName -> parseValueDeclarationEntity item

            in  
            case newState of ->
                Ok newState_ -> parserHelp rest newState_
                Err Error -> Err Error

        [] ->
            Ok (AstFromSate state)


parseEntityStart : LexItem -> Result Error State
parseEntityStart item =
    case item of
        Lexer.Token str -> 
            case toKeyword str of
                Just TypeKeyword -> Ok EntityFirstItemType
                Just ModuleKeyword -> Ok EntityFirstItemModule

                Just other -> Ok MisplacedKeyword other

                Nothing -> Ok EntityFirstItemName str

        _ -> Err InvalidStartToEntity item

Comment on lines +65 to +67
, ( "works with nested module name"
, "module Foo.Bar exposing (..)"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love these tests to not only check that there are no invalid items and that the roundtrip from/to String went OK, but show what the lexed items are... for learning purposes.

Eg. if I saw

( "works with nested module name"
, "module Foo.Bar exposing (..)"
, [Token "module", ...] -- purposefully stripped of location info, to keep signal/noise ratio up
)

that would help me understand what do the results of this lexing process look like.

Kinda like you did in the test just above this one :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went for the quickest way to make the Parser tests work for my Lexer (which was just to remove the bits that required any meaning change).

To complete your example:

( "works with nested module name"
, "module Foo.Bar exposing (..)"
, [Token "module", Whitespace 1,  Token "Foo", Sigil FullStop, Token "Bar", Whitespace 1,  Token "exposing", Whitespace 1, Sigil (Bracket Round Open), Sigil DoubleDot, Sigil (Bracket Round Close) ]"
)

or something along these lines. I.E very long! Unless I can find some way of automating I do not think I can face typing that out for all 100 or so tests in the LexerTests.elm file!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough! The tests were previously written incrementally so the cost was ... amortized :) But right now it's not a huge priority, so I agree with you - let's not add those examples right now.

@Janiczek
Copy link
Contributor

Janiczek commented Aug 6, 2020

Very interesting, thanks for prototyping that @harrysarson ! How do you feel about it so far? (Let's take this to the Discord if you want.)

I feel like we still have too little information to decide on whether to use this approach or not, since the other part that will take these LexItems and produce AST still needs to be written (if I'm correct), but so far I see no problems/blockers that would make me want to scratch this idea 🙂 👍

@harrysarson
Copy link
Contributor Author

I feel good! The progress I have made so far definately counts more as exploration than implementation. It may be that we decide to scrap this entirely and I do not think that would be a bad outcome because already I understand the problem scope a lot more than I did before. (For example, I am a big fan of these 3 aims for whatever solution we go for).

No rush either, I definitetly want to spend some time sitting on these ideas before committing myself to them.

@harrysarson
Copy link
Contributor Author

I have run out of steam here for now. Closing in the hope that I one day soon cycle back to this and make more progress.

@Janiczek
Copy link
Contributor

Thanks @harrysarson for the experiment! Yeah we can always circle back to it :)

We will have a lot of constructors and this namespacing will help tell
them apart.
run `elm repl` from ./tests/parser-tests to see live parsing!
and fix contextualisation so that the new cases do not cause crashes
we return the current state seperately so this is just duplication
and fix crashes (and buggy parsing of bad syntax into a valid AST).
@harrysarson harrysarson reopened this Sep 25, 2020
@harrysarson
Copy link
Contributor Author

harrysarson commented Sep 25, 2020

I have circled back. I am very excited about this approach. Still very rough but I can parse type aliases that do not include records or tuples!

@Janiczek as part of this PR I have experimented with auto generating tests. I have a directory of elm syntax snippets and a node/elm program that generates test cases with sources, the elm list of lexed items and the pasrsed AST. Then I use each test case for a couple of tests. Pretty cool! Currently it relies on Debug.toString producing valid elm code if I can the correct glob imports. Even if parser/lexer turns out to be a dead end, I think testing method is valuable.

we also add a special case into our test case update script so that we
can write test cases and prefix them with an underscore to exclude them
from generation
adds some classes of item
I found this to be nice for value expressions. So using it for type expressions too.
Oops! We should leave "error recovery mode" when we see a non-indented
line. We were not doing so.
This was an extra layer we do not need. Expressions are never done until they have to be done due to a newline. Therefore, we only ever used the progress variant unless we are parsing the newline token.

Because we now parse newline tokens at a nice an high level of the parser we can deal with them nicely.
needs tidy up next commit
the logic is clearer when we have two separate functions
We had this when I wanted uniform return types. Much
better now though.
removes another case of passing a callback to a function. This time we use a new custom type. Much more elmish!
still a messy function, needs more love.
`exprAppend` becomes `appendPossiblyQualifiedTokenTo` and
`parentsToLeafWith` becomes `appendTypeExprTo`. These names
do a much better job at explaining what these functions do.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants