Supporting buffers that can ignore mark requests #81

ifazk · 2019-07-20T21:53:36Z

It can sometimes be useful for buffers ignore a mark action. For example, a lexbuffer can decide that it would be inappropriate to mark at the current position because it is not a grapheme cluster boundary.

Is this a use case that the maintainers of this project would be willing to support?

ifazk · 2019-07-20T21:54:48Z

On a more technical note, as far as I understand, we would only need to make a small change to call_state in ppx_sedlex.ml. There, instead of returning a best final state, we mark the best final state and immediately backtrack. I'm trying this out in a personal fork, but I need to do a bit more testing before I can send a pull request.

pmetzger · 2019-07-22T18:09:21Z

Let us know. That said, it seems peculiar that a regular expression would mark somewhere that you're not allowed to mark. Marking is below the level of the UI after all.

ifazk · 2019-07-24T07:16:09Z

Here's an example of a situation where a regular expression would mark where you don't want it to mark. Suppose you have the regular expression leg and the utf8 encoded string for "leg̈" (in ocaml syntax that would be "leg\u{0308}"), a specially designed buffer can ignore the mark after the g because it knows that g is not at a character boundary.

pmetzger · 2019-07-26T23:25:06Z

I don't understand your example. Is the issue that you don't want to match leg when you have leg̈?

ifazk · 2019-07-26T23:29:28Z

Yes, that my issue. I don't want a match to be successful in the middle of a user-perceived character. Most people would consider g̈ to be a single character instead of two characters consisting of g and \u{0308}.

pmetzger · 2019-07-26T23:32:06Z

Okay, so that's not "supporting buffers that can ignore mark requests", that's noting a possible bug in the way the regexp engine works. Have you read the Unicode Consortium TR 18 document? I haven't looked at it recently, but what does it claim the correct behavior is supposed to be?

pmetzger · 2019-07-26T23:36:35Z

(A quick scan of TR18 seems to indicate that what you want is an implementation of 2.2.1 Grapheme Cluster Mode but I'm not sure that's precisely what you're asking for.)

ifazk · 2019-07-27T04:35:17Z

What I'm precisely asking for: I want backtracking to be the only way to take a semantic action.

I want sedlex to always mark when it is in a final DFA state, and always backtrack when it can no longer transition to new DFA states. The only case where we don't currently do this is when we are at a final DFA state and there are no transitions to new DFA states. In this case, we directly take a semantic action, instead of marking and then backtracking (see the code that #83 changes).

Always using backtrack to perform semantic actions, allows custom Sedlexing buffers to control the positions at which semantic actions are allowed to take place. In the case where we are at a final DFA state and there are no more transitions out of this state, a custom Sedlexing buffer can decide whether a semantic action should take place at this position, and if not, it backtrack to the last position it would have allowed a semantic action to take place.

ifazk · 2019-07-27T04:36:20Z

My personal use case for this: I have a custom Sedlexing buffer which is aware of Grapheme Cluster boundaries, and only stores marked state at code point positions which are at boundaries. If we try to mark at a position which is not a boundary, it just ignores the Sedlexing.mark.

This does not give us Grapheme Cluster Mode. According to the last paragraph of 2.2.1, in Grapheme Cluster Mode, we should treat /(x\u{308})/ and /(x)(\u{308})/ differently, and /(x)(\u{308})/ should never match. I'm just using sedlex as the regex engine, and it makes no distinction between /(x\u{308})/ and /(x)(\u{308})/.

Supporting any of RL2.2 Extended Grapheme Clusters, RL2.3 Default Word Boundaries or Grapheme Cluster Mode is beyond the scope of sedlex. All of these require the regex engine itself to understand grapheme clusters, words or boundaries, and I don't think it is worth the engineering effort to teach these to sedlex.

I think always using backtracking for semantic actions is a middle ground where using custom Sedlexing buffers we can only allow semantic actions to happen at boundaries.

pmetzger · 2019-07-28T20:17:22Z

Okay, given that what you wan't isn't Grapheme Cluster Mode, I have to confess, and perhaps it's just that I'm thick, that I don't understand what it is that you want to do, even having read the above a few times. Perhaps you could give an example of what the resulting syntax expressing your improvement would be in a real sedlex specification?

ifazk · 2019-07-29T01:14:54Z

Thanks for being so patient with me! I'm going to try to explain myself again.

Suppose I have the following code:

match%sedlex buf with
| "let" -> Token.LET
| "=" -> Token.EQ
| eof -> Token.EOF
| _ -> raise (Token.error @@ Sedlexing.positions buf)

and the input stream "let\u{308}".

For the above, sedlex will not call mark or backtrack at t and instead just directly take a semantic action and return Token.LET.

Here, instead of directly taking a semantic action at t, I want sedlex to call mark and then backtrack and then take the appropriate semantic action according to backtrack.

In my custom buffer, if we try to call mark at the t, my custom buffer won't actually store anything, since t does not occur before a grapheme cluster boundary. This way, when sedlex calls backtrack, I raise an error instead of returning Token.LET.

If we had the input stream "let=" instead, since t occurs at a boundary, my custom buffer will store the state, and when backtrack is called, will return Token.LET.

Instead of having sedlex worry about boundaries, I'm trying to move the burden to buffers. Currently, while a custom buffer can control what mark and backtrack do, it cannot prevent semantic actions from happening at all non-boundary codepoints, because if there are no transitions out of a DFA state, sedlex directly takes a semantic action instead of going through the buffer first.

ifazk mentioned this issue Jul 26, 2019

Giving buffers better control of returned states #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting buffers that can ignore mark requests #81

Supporting buffers that can ignore mark requests #81

ifazk commented Jul 20, 2019

ifazk commented Jul 20, 2019

pmetzger commented Jul 22, 2019

ifazk commented Jul 24, 2019

pmetzger commented Jul 26, 2019

ifazk commented Jul 26, 2019

pmetzger commented Jul 26, 2019

pmetzger commented Jul 26, 2019

ifazk commented Jul 27, 2019

ifazk commented Jul 27, 2019 •

edited

Loading

pmetzger commented Jul 28, 2019 •

edited

Loading

ifazk commented Jul 29, 2019

Supporting buffers that can ignore mark requests #81

Supporting buffers that can ignore mark requests #81

Comments

ifazk commented Jul 20, 2019

ifazk commented Jul 20, 2019

pmetzger commented Jul 22, 2019

ifazk commented Jul 24, 2019

pmetzger commented Jul 26, 2019

ifazk commented Jul 26, 2019

pmetzger commented Jul 26, 2019

pmetzger commented Jul 26, 2019

ifazk commented Jul 27, 2019

ifazk commented Jul 27, 2019 • edited Loading

pmetzger commented Jul 28, 2019 • edited Loading

ifazk commented Jul 29, 2019

ifazk commented Jul 27, 2019 •

edited

Loading

pmetzger commented Jul 28, 2019 •

edited

Loading