-
-
Notifications
You must be signed in to change notification settings - Fork 289
Description
Summary
This RFC hopes to address the concerns in #197, #261, #271, and #329 by laying the foundation of pest's evolution and transition.
Motivation
While pest grammars offer an expressive language for building grammars, they lack certain features we've become accustomed with in programming languages which weakens their effectives as expressive and reusable tools. With the growing popularity of the project, more and more discussion has been focused on improving the predictability of pest as a language and a number of needs have been put forth: trivia handling, reusability, expressiveness, and general consistency.
Trivia handling complexity
Probably the hardest concept to grasp when first learning the ropes is how trivia, i.e. whitespace and comments, are handled. pest has an automatic mechanism that simply permits trivia to live between expressions which is controlled by atomicity. Since atomic rules are cascading, it's not immediately obvious if two sequenced expressions a ~ b accept trivia—it wholly depends on whether or not the current rule inherits atomicity.
atomic = @{ definitely_not_atomic }
not_atomic = { confusing }
definitely_not_atomic = !{ confusing }
confusing = { a ~ b }
The example above illustrates how confusing can accept trivia in some cases but not others.
Reusability of expressions and rules
While rules can be composed from one another, there is currently no means to parametrize them. Parametrization can be extremely useful in cases where some idioms are often reused, e.g. repeated, separated values. Currently, you need to repeat some form of e ~ ("," ~ e)* which is less readable than separated(e, ",").
Though less immediately useful, another addition would be to be able to use rules from different grammars.
Expressiveness
Improving expressiveness is somewhat of a continuously open question. In 2.0 we've added additional stack calls that help recognize indentation-sensitive languages, namely PEEK_ALL, POP_ALL, and DROP. This conservative design was adopted in order to better understand what exactly is needed in real-world examples.
However, legitimate need of more refined localization within the stack has been illustrated in #329. Being able to accurately slice the stack for every one of the PEEK, POP, DROP calls seems to be required going forward.
General consistency
With the introduction of built-in rules, capitalization has been selected as a form of differentiation from user-defined rules. Capitalized are also stack calls, start- and end-of-input calls, and unicode categories. The only way of differentiating between them is to simply know ahead of time what they do.
Guide-level explanation
Versioning
The pest language will be versioned according to the semver guide and grammar language versions will be optionally selected before parsing. This will ensure a smoother transition to 3.0, and beyond, it will be enabling users to opt-in to the newer version early on.
Modules
Akin to Rust's modules, a module can contain rules or other modules. This removes the need for capitalization of built-in rules. They can be part of separate modules.
/// Modules can be created by importing other grammars and are immediately public.
use "cool.pest";
use "this.pest" as that;
/// pest has its own sub-modules.
any = { pest::any }
stack = { pest::stack::peek }
unicode = { pest::unicode::binary::punctuation }
Parametrizable rules
Rules will have optional arguments. Their definition will be parametrizable with argument names, all of them being valid pest expressions.
/// Definition
separated(e, s) = _{ e ~ (s ~ e)* }
/// Use
comma_separated(e) = _{ separated(e, ",") }
Controlled trivia
The infix sequence operator ~ itself will be a user-defined rule:
~(lhs, rhs) = { lhs ~ " "* ~ rhs }
Without any ~ defined, ~, *, +, and {} operators will all run according to their definitions without accepting any trivia between expressions. When it is defined, the repetitions will make use of the sequence operator:
*(e) = { e? ~ e* }
+(e) = { e ~ e* }
/// ... etc.
In order to be able to have both trivia-accepting and non-trivia-accepting operators working together, separate non-trivia operators will be introduced, namely - for sequence and all repetitions preceded by it:
| Operator | Trivia | Non-trivia |
|---|---|---|
| Sequence | ~ |
- |
| Repeat zero or more times | * |
-* |
| Repeat one or more times | + |
-+ |
| Repeat exactly n times | {n} |
-{n} |
| Repeat minimum of n times | {n..} |
-{n..} |
| Repeat maximum of n - 1 times | {..n} |
-{..n} |
| Repeat maximum of n times | {..=n} |
-{..=n} |
| Repeat between m and n - 1 times | {m..n} |
-{m..n} |
| Repeat between m and n times | {m..=n} |
-{m..=n} |
Stack slicing
Stack slicing will work similarly to Rust slicing with the exception that ranges will accept negative end values, similarly to Python. Slicing will happen from bottom to top such that for a stack [a, b, c, d, e]:
[1] == a[-1] == e[1..4] == [b, c, d][1..-1] == [b, c, d][1..=-1] == [b, c, d, e][..-2] == [a, b, c]
As such, pest::stack::*, i.e. peek, pop, drop, can be optionally sliced or indexed, e.g. pest::stack::peek[..-1]. The indices will be constant with the exception of those relative to the top of the stack since the stack's size is variable.
Reference-level explanation
The grammar's version will be selected through the grammar attribute:
#[grammar = "grammar.pest", version = "3.0"]pest_meta will handle both grammar language versions during the 2.* transition period, then migrate to 3.0. This will need to be enforced if we want to take advantage of the more concise grammars during optimization and generation.
Much of the rest of this RFC is straight-forward:
- add second grammar
- implement validation
- add module resolution to AST (in
pest_metaandpest_generator) - add rule parameters to AST (in
pest_metaandpest_generator)
Drawbacks
Breaking compatibility so early could be dangerous, but we can offer help for people migrating to 3.0. If need be, we could also offer a pest fix tool that would be able to convert 2.0 to 3.0 grammars.
Some of the syntax introduced in the trivia handling might be a little heavy on the eye and we might want to fine tune it before it's set in stone.