-
Notifications
You must be signed in to change notification settings - Fork 25
parser lexer #112
base: master
Are you sure you want to change the base?
parser lexer #112
Conversation
The only really interesting thing here at this point in time (and the reason I am putting this PR out) is the compiler/src/Stage/Parse/Lexer.elm Lines 8 to 73 in a216421
Thoughts on these custom types would be very much appreciated! |
In my experience of elm/parser parser, the most painful thing is understanding what the state of parser is when it fails. The parser
If the lexer/parser approach is going to be superior to the elm/parser parser then it must do as good with (1) and (2) (which are the error information that users will see) but also help with (3) (this information is key to debugging or hacking around with the compiler). I think the reason elm/parser parsers are poor at (3) is because they are a wrapper around an opaque function. Functions are in there nature impossible to compare and hard to visualise. Therefore, when I am designing the parser part of the lexer/parser I need to avoid the temptation to use function composition/lots of recursion. Instead I should try to craft some custom types which describe the state of the parser (what fraction of the AST it has produced so far) and compose these types. Then, if there is a problem I can dump this fraction of the AST which should help with debugging. |
The lexer has no idea whether an operator is binary or not! Let's not pretend it does
I think this is how I want to structure the parser part of the lexer/parser. (An entity is a word I assigned today to anything in an elm file consisting of a line with no indentation followed by zero or more lines with indentation --- for example a custom type declaration, a type annotation, a type State
= EntityStart
| EntityFirstItem EntityFirstItemType
type EntityFirstItem
= EntityFirstItemType
| EntityFirstItemModule
| EntityFirstItemName
| ...
type Error
= **InvalidStartToEntity**
| MisplacedKeyword Keyword
parser : List (LexItem) -> SomeResult
parser items =
parserHelp items EntityStart
parserHelp : List (LexItem) -> State -> Result Error Ast
parserHelp items state =
case items of
item :: rest ->
let
newState =
case state of
EntityStart -> parseEntityStart item
EntityFirstItem EntityFirstItemType -> parseTypeEntity item
EntityFirstItem EntityFirstItemModule -> parseModuleEntity item
EntityFirstItem EntityFirstItemName -> parseValueDeclarationEntity item
in
case newState of ->
Ok newState_ -> parserHelp rest newState_
Err Error -> Err Error
[] ->
Ok (AstFromSate state)
parseEntityStart : LexItem -> Result Error State
parseEntityStart item =
case item of
Lexer.Token str ->
case toKeyword str of
Just TypeKeyword -> Ok EntityFirstItemType
Just ModuleKeyword -> Ok EntityFirstItemModule
Just other -> Ok MisplacedKeyword other
Nothing -> Ok EntityFirstItemName str
_ -> Err InvalidStartToEntity item
|
, ( "works with nested module name" | ||
, "module Foo.Bar exposing (..)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love these tests to not only check that there are no invalid items and that the roundtrip from/to String went OK, but show what the lexed items are... for learning purposes.
Eg. if I saw
( "works with nested module name"
, "module Foo.Bar exposing (..)"
, [Token "module", ...] -- purposefully stripped of location info, to keep signal/noise ratio up
)
that would help me understand what do the results of this lexing process look like.
Kinda like you did in the test just above this one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I went for the quickest way to make the Parser tests work for my Lexer (which was just to remove the bits that required any meaning change).
To complete your example:
( "works with nested module name"
, "module Foo.Bar exposing (..)"
, [Token "module", Whitespace 1, Token "Foo", Sigil FullStop, Token "Bar", Whitespace 1, Token "exposing", Whitespace 1, Sigil (Bracket Round Open), Sigil DoubleDot, Sigil (Bracket Round Close) ]"
)
or something along these lines. I.E very long! Unless I can find some way of automating I do not think I can face typing that out for all 100 or so tests in the LexerTests.elm file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! The tests were previously written incrementally so the cost was ... amortized :) But right now it's not a huge priority, so I agree with you - let's not add those examples right now.
Very interesting, thanks for prototyping that @harrysarson ! How do you feel about it so far? (Let's take this to the Discord if you want.) I feel like we still have too little information to decide on whether to use this approach or not, since the other part that will take these LexItems and produce AST still needs to be written (if I'm correct), but so far I see no problems/blockers that would make me want to scratch this idea 🙂 👍 |
I feel good! The progress I have made so far definately counts more as exploration than implementation. It may be that we decide to scrap this entirely and I do not think that would be a bad outcome because already I understand the problem scope a lot more than I did before. (For example, I am a big fan of these 3 aims for whatever solution we go for). No rush either, I definitetly want to spend some time sitting on these ideas before committing myself to them. |
I have run out of steam here for now. Closing in the hope that I one day soon cycle back to this and make more progress. |
Thanks @harrysarson for the experiment! Yeah we can always circle back to it :) |
We will have a lot of constructors and this namespacing will help tell them apart.
run `elm repl` from ./tests/parser-tests to see live parsing!
and fix contextualisation so that the new cases do not cause crashes
we return the current state seperately so this is just duplication
and fix crashes (and buggy parsing of bad syntax into a valid AST).
I have circled back. I am very excited about this approach. Still very rough but I can parse type aliases that do not include records or tuples! @Janiczek as part of this PR I have experimented with auto generating tests. I have a directory of elm syntax snippets and a node/elm program that generates test cases with sources, the elm list of lexed items and the pasrsed AST. Then I use each test case for a couple of tests. Pretty cool! Currently it relies on |
we also add a special case into our test case update script so that we can write test cases and prefix them with an underscore to exclude them from generation
adds some classes of item
I found this to be nice for value expressions. So using it for type expressions too.
Oops! We should leave "error recovery mode" when we see a non-indented line. We were not doing so.
This was an extra layer we do not need. Expressions are never done until they have to be done due to a newline. Therefore, we only ever used the progress variant unless we are parsing the newline token. Because we now parse newline tokens at a nice an high level of the parser we can deal with them nicely.
needs tidy up next commit
the logic is clearer when we have two separate functions
We had this when I wanted uniform return types. Much better now though.
removes another case of passing a callback to a function. This time we use a new custom type. Much more elmish!
still a messy function, needs more love.
`exprAppend` becomes `appendPossiblyQualifiedTokenTo` and `parentsToLeafWith` becomes `appendTypeExprTo`. These names do a much better job at explaining what these functions do.
Very light on details here, I will flesh out the chat soon. In short, this will be an attempt to replace our current parser with a lexer/parser two stage parser (see this screencast from which I have taken inspiration).
The first commit adds a lexer that can (maybe, probably not) parse any valid elm program. Moreover, it should never fail to parse anything ever (see the
Invalid
LexItem
into which I will stick anything that I cannot parse properly). I aim for thetext => [ LexItems ] => text
conversation change to never be lossy.