Suppose I'm trying to lex this invalid Rust code: b"\xa". The problem here is \x needs to be followed by two hex digits, not one.
If I run this with rustc I get an "invalid escape" error, as expected.
If I run this with lexgen_rust, I get an id b first, then an error.
The problem is with backtracking. The lexgen-generated lexer records the successful match for b as an identifier and continues lexing, to be able to return the longest match. When it fails to match the rest of the token, it returns b as an identifier.
Instead what we want to do is, when we see b" we want to "commit" to the byte string rule, i.e. no backtracking from that point. If the rest of the token is not a valid byte string then we don't return b as an id and fail.
This is trivial to implement once we come up with a syntax: just reset the last_match when we make a transitions with a "cut" (or "commit") annotation.
Currently the workaround is to have a lexer state for lexing the string body. So instead of this:
rule Init {
...
"b\"" ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
let match_ = lexer.match_();
lexer.return_(Token::Lit(Lit::ByteString(match_)))
},
}
We need something like:
rule Init {
...
"b\"" => |lexer| lexer.switch(LexerRule::ByteString),
}
rule ByteString {
($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
let match_ = lexer.match_();
lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
},
}
Since the idea is similar to Prolog's "cut", I suggest a similar syntax:
rule Init {
...
"b\"" ! ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
let match_ = lexer.match_();
lexer.return_(Token::Lit(Lit::ByteString(match_)))
},
}
That ! above is "cut" (or "commit"), meaning once b" is matched there is no backtracking, we either match rest of the string according to the current rule, or fail with an error pointing to the character b.
I wonder if other lexer generators have a syntax for this kind of thing?
Suppose I'm trying to lex this invalid Rust code:
b"\xa". The problem here is\xneeds to be followed by two hex digits, not one.If I run this with rustc I get an "invalid escape" error, as expected.
If I run this with lexgen_rust, I get an id
bfirst, then an error.The problem is with backtracking. The lexgen-generated lexer records the successful match for
bas an identifier and continues lexing, to be able to return the longest match. When it fails to match the rest of the token, it returnsbas an identifier.Instead what we want to do is, when we see
b"we want to "commit" to the byte string rule, i.e. no backtracking from that point. If the rest of the token is not a valid byte string then we don't returnbas an id and fail.This is trivial to implement once we come up with a syntax: just reset the
last_matchwhen we make a transitions with a "cut" (or "commit") annotation.Currently the workaround is to have a lexer state for lexing the string body. So instead of this:
We need something like:
Since the idea is similar to Prolog's "cut", I suggest a similar syntax:
That
!above is "cut" (or "commit"), meaning onceb"is matched there is no backtracking, we either match rest of the string according to the current rule, or fail with an error pointing to the characterb.I wonder if other lexer generators have a syntax for this kind of thing?