Implement "cut" (as in Prolog) for comitting to a choice

Suppose I'm trying to lex this invalid Rust code: `b"\xa"`. The problem here is `\x` needs to be followed by two hex digits, not one.

If I run this with rustc I get an "invalid escape" error, as expected.

If I run this with [lexgen_rust](https://github.com/osa1/lexgen_rust), I get an id `b` first, then an error.

The problem is with backtracking. The lexgen-generated lexer records the successful match for `b` as an identifier and continues lexing, to be able to return the longest match. When it fails to match the rest of the token, it returns `b` as an identifier.

Instead what we want to do is, when we see `b"` we want to "commit" to the byte string rule, i.e. no backtracking from that point. If the rest of the token is not a valid byte string then we don't return `b` as an id and fail.

This is trivial to implement once we come up with a syntax: just reset the `last_match` when we make a transitions with a "cut" (or "commit") annotation.

Currently the workaround is to have a lexer state for lexing the string body. So instead of this:

```rust
rule Init {
    ...

    "b\"" ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}
```

We need something like:

```rust
rule Init {
    ...

    "b\"" => |lexer| lexer.switch(LexerRule::ByteString),
}

rule ByteString {
    ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
    },
}
```

Since the idea is similar to Prolog's "cut", I suggest a similar syntax:

```rust
rule Init {
    ...

    "b\"" ! ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::ByteString(match_)))
    },
}
```

That `!` above is "cut" (or "commit"), meaning once `b"` is matched there is no backtracking, we either match rest of the string according to the current rule, or fail with an error pointing to the character `b`.

I wonder if other lexer generators have a syntax for this kind of thing?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement "cut" (as in Prolog) for comitting to a choice #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement "cut" (as in Prolog) for comitting to a choice #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions