Skip to content

Collection of enhancements and fixes#9

Open
csnover wants to merge 44 commits intoreloginn:mainfrom
csnover:main
Open

Collection of enhancements and fixes#9
csnover wants to merge 44 commits intoreloginn:mainfrom
csnover:main

Conversation

@csnover
Copy link

@csnover csnover commented Sep 1, 2025

Hello,

This PR contains the previous three PRs that I opened separately, plus additional patches. Changes include:

  • Adding repository to Cargo.toml (Add repository to Cargo.toml #6)
  • Replacing use of Rust strings with byte arrays, since Lua strings are not UTF-8 (Replace Rust strings with byte arrays #7)
  • Add support of the inverted character classes (Implement inverted character class #8)
  • Enabling and fixing pedantic clippy lints
  • Using a typed Error API with token position information
  • Additional convenience functions internally to make parsing code a bit clearer
  • Remove all inefficient boxing, reference counting, collecting, cloning, etc. in lieu of references with appropriate lifetimes
  • Replace use of HashMap in the API with a simple callback function so the table type is agnostic to the consumer
  • Implement position capture groups (this got smeared across a couple of commits, apologies for the mess)
  • Restructure the public API to organise public modules by function, hide internals, remove excess Options
  • Document the public APIs
  • Add continuous integration using basic GH workflow

The edition and MSRV are bumped to use the stabilised if-and-while-let chain syntax. This isn’t strictly necessary but it makes the code cleaner in areas and I don’t know why not to do this.

The error handling is not quite as good as it could be, but probably more than good enough, given the scope of the library. In particular, there are redundant variants for unexpected tokens because I ran out of interest in continuing to work on it. syn’s Lookahead1 would be a good way to collect all possible expected tokens into something that can then be collected into an error.

I would recommend also to convert the internal engine tests (in src/engine/tests.rs) to ones that check the public APIs.

I suppose this should make the library more or less 1.0-ready, since the API is well-defined and limited, but I only bumped the pre-release number due to the many API-breaking changes.

OK, that‘s it from me for now. I will be able to answer questions or maybe make minor changes but don’t anticipate working on this any more from now (unless I run into some other bug).

Thanks!

csnover added 15 commits August 31, 2025 14:58
Lua strings are not UTF-8 and so it is incorrect to use Rust
strings. In particular, some Lua libraries specifically designed
to add support for Unicode, like ustring, use `string.find` to look
for non-ASCII characters in Lua strings, which is impossible to do
when Lua strings are treated as UTF-8 (a character set of
`[\x80-\xff]` is ill-formed).

This is an API-breaking change.

Note that downstream consumers like piccolo will now convert using
`IntoValue for Vec<T>` instead of `IntoValue for StdString` and
so those need to be changed to call `ctx.intern` explicitly instead
of relying on the auto-conversion.
PIL 20.2: An upper case version of any of those classes represents
the complement of the class.
Because this changes the signature of an exported function to
remove an unnecessary `Result` and a struct to eliminate an
unnecessary `Box`, this patch contains API-breaking changes.
This makes the replacement method agnostic to whatever type the
consumer is using instead of requiring the table to be collected
into a `HashMap`.

This is an API-breaking change.
Lua 5.4 § 6.4.1:

> As a special case, the capture () captures the current string
> position (a number). For instance, if we apply the pattern
> "()aa()" on the string "flaaap", there will be two captures: 3
> and 5.
@reloginn
Copy link
Owner

reloginn commented Sep 2, 2025

Hello! I will definitely review everything in the next two days and provide comments. Thank you so much!

This is necessary for interleaving VM function calls where a
function callback cannot complete synchronously, without forcing
the normal API to be async.
@csnover
Copy link
Author

csnover commented Sep 2, 2025

Thanks! After everything, it was also necessary to add interruptible gsub so I pushed a change for that. At the same time I noticed a couple places where I forgot to add Eq markers. I’m definitely, probably, done now, pending any feedback.

I also have completed working implementation of these methods for the piccolo VM stdlib now; if you want me to put those somewhere or to send something upstream to supersede your PR, let me know.

Closely associated types can be stored in the same module. This
reduces the amount of re-exporting that has to happen to get
everything out, reduces file switching during development, and
may decrease the chance of creating confusing type name conflicts.
Due to current non-conformance in the engine, this causes tests to
fail.
It is not possible for the lexer to know the nesting level of sets
since the pattern `[]]` is valid.

There were bad tests in the lsonar test suite which did not
conform to PUC-Lua which have been changed for conformance.
This represents the null character and is tested in the PUC-Lua
test suite.
6.4.1:

> %x: (where x is any non-alphanumeric character)
6.4.1:

> The beginning and the end of the subject are handled as if they
> were the character '\0'.
6.4.1:

> A pattern item can be […] a single character class followed by
> '*' […]

Captures, balances, and frontiers are not character classes.
The previous design, while beautiful, was inefficient and not
capable of correctly handling non-greedy matching inside a tree
node. An example of this is with the common pattern for trimming
strings, `^%s*(.-)%s*$`.

A non-greedy match needs to take one token per iteration until
the *entire pattern* fails to parse, but this is not possible when
only a subtree is visible to the repetition function. Trying to
deal with this by naïve linearisation means that the edge of the
capture is lost.

I imagine that there is some technique that would allow this to
work (I am no parser expert) but the thing about Lua patterns is
that they are way too simple for this level of complexity in the
first place. So, this commit deletes the old engine and replaces it
with a hand-translation of the PUC-Lua pattern matching code which
works and passes the Lua test suite.
@csnover
Copy link
Author

csnover commented Sep 5, 2025

Hi again. This PR has totally changed because somehow it took me way too long to see that this library as written was not actually matching according to actual Lua language? I’m very confused about what source was used to decide what Lua patterns look like. A lot of time was wasted.

For posterity, I retained the historical attempts to change the old design to actually conform, but it was impossible due to the broken non-greedy backtracking. As an academic point I would enjoy knowing whether there is some way to actually make the old design not be broken in this way without making it even more inefficient, but in real life this new one will work better. Since it does not create an IR, it is substantially more efficient. It could be touched up to be more Rust-y, but I’ve spent too much time on this already.

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants