Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge subpattern references #18

Open
wants to merge 134 commits into
base: master
Choose a base branch
from

Conversation

nbtrap
Copy link
Contributor

@nbtrap nbtrap commented Mar 1, 2014

Subpattern references enable the matching of self-similar strings by
way of recursion. Unlike backreferences, which refer to the string
matched by a register, subpattern references refer to the pattern
contained within the register and cause the regex engine to recurse,
as though by an actual function call, to the referenced subpattern.

SYNTAX

A subpattern reference node has the form

(:SUBPATTERN-REFERENCE <ref>)

where is a positive fixnum denoting a register number or a
string or symbol denoting a register name.

Using the Perl syntax, a subpattern reference looks like

(?N)

or
(?&NAME)

where N is a positive (decimal) integer and NAME is a register name.

API CHANGES

There are no API changes.

KNOWN ISSUES

Perl Incompatibilities

The semantics of subpattern references (or "sub calls") in Perl are
not well defined. In particular, as of version 5.19.9, the
interaction between subpattern references and backreferences is
inconsistent. This issue was recently raised on the p5p mailing
list, and the Perl devs seem to be seriously considering adopting
the semantics implemented here. See
https://rt.perl.org/Public/Bug/Display.html?id=121299 for details.

Embedded Modifiers

The interaction between subpattern references and embedded modifiers
(e.g. :CASE-INSENSITIVE-P) is undefined for now and will be
addressed in a future release.

AllegroCL Compatibility Mode

So far as I know, the AllegroCL compatibility mode (enabled by
adding :USE-ACL-REGEXP2-ENGINE to FEATURES before compiling) does
not support this feature.

Other Bugs

Several outstanding bugs are known to at least indirectly affect
subpattern references. Cf. #17 and #12, for example.

IMPLEMENTATION DETAILS

During the match phase, the subpattern reference closure calls the
register closure, passing it an extra argument: the match
continuation.

When the register closure sees that it has been called with an extra
argument, it knows that it has been entered via subpattern
reference. At this point, it saves the state of the local
registers' offsets and creates new dynamic "bindings" for them.
Then it calls the register's inner matcher, restoring the register
offsets state upon return therefrom. If the inner matcher has
succeeded, the subpattern reference's continuation is called.

The presence of one or more subpattern references precludes certain
optimizations. However, the performance for existing code (i.e.,
for regular expressions not containing subpattern references) should
be unaffected hereby.

OTHER CHANGES

The testing code has been overhauled. Of note:

1. The Perl script that generates many of tests has been modified,
among other things, to print results for as many capture groups as
are defined by each regex, but no more.  (The purpose of this was
to support tests involving arbitrarily many capture groups.)
Thus, much of perltestdata file seems to have changed, but it's
mostly superficial.

2. Perl tests are now run with *ALLOW-NAMED-REGISTERS* bound to T
when the regex contains one or more named registers.

3. Many more tests have been added to *TESTS-TO-SKIP*.  These are
by and large a result of Perl's undefined behavior vis-a-vis
subpattern references.

Be sure to keep track of named subpattern references as well as the
highes numbered subpattern reference encountered.
…ERT.

Also, keep track of which registers have been referenced by number.
… register closure.

This required several things that may not have been necessary and will
have to be revisited.  First of all, for every register, we now create
two inner matchers: one that matches the contents of the register and
what follows the register, and one that only matches the contents of
the register.  Also, we now stop accumulating into STARTS-WITH once we
encounter a register or subpattern reference.

With this patch, subpattern references seem to work for the most part.
They do not yet work with repetitions.
At this point, one thing that doesn't work quite right is the
determination of register offsets for registers accessed indirectly by
subpattern references.  For example:

  (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")

says that the second register is at position (3, 6), though it should
be (1, 6).  Fixing this will require binding a special variable from
subpattern reference closures that tells register closures not to
touch the register offsets.
…ctly from a subpattern reference.

With this patch, the following invocation:
  (cl-ppcre:scan "(\\([^()]*((?1)\\)|\\)))" "((()))")
gives the correct offset values for the second register as (1,6).

One problem that remains is the danger of infinite recursion during
backtracking.  The following invocation:
  (cl-ppcre:scan "(?1)(?2)(a|b|(?1))(c)" "acba")

causes a stack overflow because the second (?1) is called endlessly
during backtracking without the match position advancing through the
string.  Such behavior may be able to be remedied by having the
subpattern reference's closure keep track of where in *STRING* it has
been called before.
This is going to be reverted immediately, since apparently Perl isn't
smart enough to do this and will itself overflow the stack.
1634 and 1635 currently don't work.
…trings in patterns containing subpattern references.
Current, the following tests fail: 1638, 1639, 1641, 1642, 1643, 1644,
1645, 1646.
@nbtrap
Copy link
Contributor Author

nbtrap commented Mar 1, 2014

Tested on SBCL, ECL, and CLISP.

The documentation says that subpattern refs were added in version 2.1.0, so you might want to change that if you don't bump the version number like that.

I wrote the pull request as plain text (ignoring GitHub's "markdown") so it could be used as the merge's commit message.

;; only push the register states for this register and registers
;; local to it
(loop for idx from num upto (+ num subregister-count) do
(let ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this gratitious let.

@hanshuebner
Copy link
Member

My review does not constitute a willingness to merge the change, which is up to Edi to decide.

@nbtrap
Copy link
Contributor Author

nbtrap commented Mar 1, 2014

Suggested changes have been made.

This should be a SPECIAL declaration, not a type declaration.
@gefjon
Copy link

gefjon commented Jan 10, 2022

This PR is beyond my ability to review. It's also quite old. Is anyone still interested?

@stassats
Copy link
Member

Yeah, that looks tricky. Don't worry about leaving old PRs, maybe someday it'll be useful, a closed PR will never find the light.

@gefjon gefjon added the stale PRs that have languished and will require considerable updates before considering merging label Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PRs that have languished and will require considerable updates before considering merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants