Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of Unicode property escapes to match a character #375

Open
mojavelinux opened this issue Mar 7, 2023 · 11 comments
Open

Allow use of Unicode property escapes to match a character #375

mojavelinux opened this issue Mar 7, 2023 · 11 comments
Labels
enhancement New feature or request

Comments

@mojavelinux
Copy link

mojavelinux commented Mar 7, 2023

In a character expression (e.g., [a-z]), I would like to be able to use a Unicode property escape (i.e., Unicode Character Category) to express the group of characters to match. The reason for this request is to parse input that contains reserved syntax that's not limited to the ASCII character set.

For example, I could define a rule to match any alpha character as defined by Unicode using the following parsing expression:

alpha = [\p{Alpha}]

I would only expect the property escape to be passed through to the underlying regular expression. Peggy would just need to allow for the \p{...} and \P{...} sequence to be used inside the square brackets of a character expression in the grammar file. Additionally, the "u" flag must be added to the regular expression.

In fact, we see that even peggy's own grammar language has such a need: https://github.com/peggyjs/peggy/blob/main/src/parser.pegjs#L476C2-L530 While I'm not suggesting that Peggy itself use these escape sequences, it would be beneficial for users of Peggy to be able to make use of them, certainly more reasonable than having to maintain all those categories.

@mojavelinux
Copy link
Author

I have been able to patch in support for Unicode property escapes using a Peggy plugin. Here's the quick and dirty code to do that:

'use strict'

function rewriteRegExps (node) {
  const children = node.children
  for (const [idx, child] of children.entries()) {
    if (typeof child === 'string') {
      if (child.includes('var peg$r') && child.includes('p{')) {
        // we are looking for that pattern "p{...}"
        children[idx] = child.replace(/^( *var peg\$r\d+ .*? )(\/.*p\{.+?\}.*\/)(;.*)/gm, (match, before, rx, after) => {
          return before + rx.replace(/(?!<\\)p\{.+?\}/g, '\\$&') + 'u' + after
        })
        break
      }
    } else {
      rewriteRegExps(child)
    }
  }
}

module.exports = {
  use (config, options) {
    config.passes.generate.push((ast) => {
      rewriteRegExps(ast.code)
    })
  }
}

@hildjj
Copy link
Contributor

hildjj commented Mar 7, 2023

As a quick workaround, you can use:

alpha = char:. &{ return char.match(/^\p{Alpha}$/u) }

@mojavelinux
Copy link
Author

I actually prefer the workaround using a plugin, which is actually quite a nice feature to tap into for workarounds like this. Since these character classes show up all over the grammar, using semantic predicates simply make the grammar too difficult to read.

@hildjj
Copy link
Contributor

hildjj commented Mar 8, 2023

See #378. If we can generate good modern code for people that want it, I'm much more interested in taking this functionality into the core of Peggy.

@mojavelinux
Copy link
Author

I'm left scratching my head trying to figure out what your last comment is referring to. In case I caused confusion, I wasn't suggesting that my plugin be accepted into the core of Peggy. I was just saying I think it's a cleaner approach as a workaround in the interim.

What I'm requesting is for the Peggy grammar parser to permit Unicode property escapes in a character expression. We know that the grammar already excepts escapes for certain literals such as \n, Unicode escapes like \u00a0, and ranges like a-z. What I'm proposing is to extend that to Unicode property escapes, which are far more powerful and more concise (the peggy grammar being the case in point).

@hildjj
Copy link
Contributor

hildjj commented Mar 11, 2023

I understand what you want, and I want it too. In order to use Unicode escapes, you have to have a late enough JS implementation that supports them. That's going to cause us some backward-compatibility work.

@mojavelinux
Copy link
Author

Cool. Sounds like we're on the same page.

Regarding backward-compatibility, what I'm thinking is that if you use them, that's an indication that you want them. I don't think there's any expectation that if you use them, that the parser will work if you use a version of JS/Node.js that doesn't support them. Trying to put in shim would be an overreach.

@hildjj
Copy link
Contributor

hildjj commented Mar 12, 2023

What if it's a warning, unless you're doing output type "es"?

@mojavelinux
Copy link
Author

That wouldn't be ideal for me since I use Node.js 16/18 with commonjs. The transition to es has been too bumpy in my view and so I stick with the commonjs format.

A possibly compromise would be a compliance setting, something akin to what eslint does. That way, there's a mechanism to communicate to the compiler that it can use/permit certain ECMAScript features. Something like "--compliance-level=es5" or whatever.

Having said that, every modern browser and active Node version supports Unicode property escapes in regular expressions. So I caution against overthinking this.

@hildjj
Copy link
Contributor

hildjj commented Mar 12, 2023

Nod, solid argument. Thinking some more.

@reverofevil
Copy link

I actually prefer the workaround using a plugin

I just wanted to point out that regular expression language is not regular, and thus cannot be parsed with regular expressions. \p{, \\p{ and \\\p{ have different meaning depending on number of \, and the only correct way to do that transform is to actually add Unicode property escapes into peggy's grammar.

In order to use Unicode escapes, you have to have a late enough JS implementation that supports them.

This is an another case of "it's on codegen side", and for already mentioned reasons I'd rather not think too hard about checking this stuff right now even for peggy's own JS codegen.

@hildjj hildjj added the enhancement New feature or request label Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants