Skip to content

pest language evolution #333

@dragostis

Description

@dragostis

Summary

This RFC hopes to address the concerns in #197, #261, #271, and #329 by laying the foundation of pest's evolution and transition.

Motivation

While pest grammars offer an expressive language for building grammars, they lack certain features we've become accustomed with in programming languages which weakens their effectives as expressive and reusable tools. With the growing popularity of the project, more and more discussion has been focused on improving the predictability of pest as a language and a number of needs have been put forth: trivia handling, reusability, expressiveness, and general consistency.

Trivia handling complexity

Probably the hardest concept to grasp when first learning the ropes is how trivia, i.e. whitespace and comments, are handled. pest has an automatic mechanism that simply permits trivia to live between expressions which is controlled by atomicity. Since atomic rules are cascading, it's not immediately obvious if two sequenced expressions a ~ b accept trivia—it wholly depends on whether or not the current rule inherits atomicity.

atomic                = @{ definitely_not_atomic }
not_atomic            =  { confusing }
definitely_not_atomic = !{ confusing }

confusing = { a ~ b }

The example above illustrates how confusing can accept trivia in some cases but not others.

Reusability of expressions and rules

While rules can be composed from one another, there is currently no means to parametrize them. Parametrization can be extremely useful in cases where some idioms are often reused, e.g. repeated, separated values. Currently, you need to repeat some form of e ~ ("," ~ e)* which is less readable than separated(e, ",").

Though less immediately useful, another addition would be to be able to use rules from different grammars.

Expressiveness

Improving expressiveness is somewhat of a continuously open question. In 2.0 we've added additional stack calls that help recognize indentation-sensitive languages, namely PEEK_ALL, POP_ALL, and DROP. This conservative design was adopted in order to better understand what exactly is needed in real-world examples.

However, legitimate need of more refined localization within the stack has been illustrated in #329. Being able to accurately slice the stack for every one of the PEEK, POP, DROP calls seems to be required going forward.

General consistency

With the introduction of built-in rules, capitalization has been selected as a form of differentiation from user-defined rules. Capitalized are also stack calls, start- and end-of-input calls, and unicode categories. The only way of differentiating between them is to simply know ahead of time what they do.

Guide-level explanation

Versioning

The pest language will be versioned according to the semver guide and grammar language versions will be optionally selected before parsing. This will ensure a smoother transition to 3.0, and beyond, it will be enabling users to opt-in to the newer version early on.

Modules

Akin to Rust's modules, a module can contain rules or other modules. This removes the need for capitalization of built-in rules. They can be part of separate modules.

/// Modules can be created by importing other grammars and are immediately public.
use "cool.pest";
use "this.pest" as that;

/// pest has its own sub-modules.
any     = { pest::any }
stack   = { pest::stack::peek }
unicode = { pest::unicode::binary::punctuation }

Parametrizable rules

Rules will have optional arguments. Their definition will be parametrizable with argument names, all of them being valid pest expressions.

/// Definition
separated(e, s) = _{ e ~ (s ~ e)* }

/// Use
comma_separated(e) = _{ separated(e, ",") }

Controlled trivia

The infix sequence operator ~ itself will be a user-defined rule:

~(lhs, rhs) = { lhs ~ " "* ~ rhs }

Without any ~ defined, ~, *, +, and {} operators will all run according to their definitions without accepting any trivia between expressions. When it is defined, the repetitions will make use of the sequence operator:

*(e) = { e? ~ e* }
+(e) = { e ~ e* }
/// ... etc.

In order to be able to have both trivia-accepting and non-trivia-accepting operators working together, separate non-trivia operators will be introduced, namely - for sequence and all repetitions preceded by it:

Operator Trivia Non-trivia
Sequence ~ -
Repeat zero or more times * -*
Repeat one or more times + -+
Repeat exactly n times {n} -{n}
Repeat minimum of n times {n..} -{n..}
Repeat maximum of n - 1 times {..n} -{..n}
Repeat maximum of n times {..=n} -{..=n}
Repeat between m and n - 1 times {m..n} -{m..n}
Repeat between m and n times {m..=n} -{m..=n}

Stack slicing

Stack slicing will work similarly to Rust slicing with the exception that ranges will accept negative end values, similarly to Python. Slicing will happen from bottom to top such that for a stack [a, b, c, d, e]:

  • [1] == a
  • [-1] == e
  • [1..4] == [b, c, d]
  • [1..-1] == [b, c, d]
  • [1..=-1] == [b, c, d, e]
  • [..-2] == [a, b, c]

As such, pest::stack::*, i.e. peek, pop, drop, can be optionally sliced or indexed, e.g. pest::stack::peek[..-1]. The indices will be constant with the exception of those relative to the top of the stack since the stack's size is variable.

Reference-level explanation

The grammar's version will be selected through the grammar attribute:

#[grammar = "grammar.pest", version = "3.0"]

pest_meta will handle both grammar language versions during the 2.* transition period, then migrate to 3.0. This will need to be enforced if we want to take advantage of the more concise grammars during optimization and generation.

Much of the rest of this RFC is straight-forward:

  1. add second grammar
  2. implement validation
  3. add module resolution to AST (in pest_meta and pest_generator)
  4. add rule parameters to AST (in pest_meta and pest_generator)

Drawbacks

Breaking compatibility so early could be dangerous, but we can offer help for people migrating to 3.0. If need be, we could also offer a pest fix tool that would be able to convert 2.0 to 3.0 grammars.

Some of the syntax introduced in the trivia handling might be a little heavy on the eye and we might want to fine tune it before it's set in stone.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions