Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more comprehensive exhaustiveness queries for Programs and matches. #53

Open
olson-sean-k opened this issue Nov 27, 2023 · 1 comment

Comments

@olson-sean-k
Copy link
Owner

Exhaustiveness describes the extent of a match in a directory tree. A Program (glob) is exhaustive if a match indicates that any path in the sub-tree must also match the pattern. Today, this is provided in the public API by Program::is_exhaustive.

Program::is_exhaustive returns a bool, but this should really be a trivalent logic type. Some patterns have indeterminate exhaustiveness in the absence of a specific candidate path due to alternatives. For example, {**/*.txt,foo/**} is indeterminate: it is nonexhaustive when matching the path bar/baz.txt but is exhaustive when matching the path foo. I plan to introduce a more appropriate type to represent this like the following:

#[derive(Clone, Copy, Debug, Eq, Hash PartialEq)]
pub enum When {
    Never,
    Sometimes,
    Always,
}

let glob = Glob::new(...).unwrap();
if glob.exhaustiveness().is_always() { ... }

Given a candidate path, it should be possible to definitively determine the exhaustiveness of a match, so a bool can be used in that context: the match is either exhaustive or not. This could be provided by Program::matched:

let candidate = CandidatePath::from(...);
let glob = Glob::new(...).unwrap();
if let Some(matched) = glob.matched(&candidate) {
    let text = matched.text(); // Gets the `MatchedText`. This is returned directly today.
    if matched.is_exhaustive() { ... }
}

Unfortunately, there are some big challenges in implementing this.

  • I don't believe there is a way to query a compiled regular expression for which branch of an alternative matched a given input. Barring that, it could be necessary to compile regular expressions for each branch of an alternative along with some way to pair them with components in the matched path. Yikes.
  • Exhaustiveness queries are still inaccurate and need improvement. There are many false negatives today (note that false positives are serious bugs). I've done a lot of experimental work on this outside of this repository, but nothing has landed here yet.
  • Despite the progress I've made on variance and exhaustiveness queries, they still don't support an indeterminate result (When::Sometimes) and I'm not yet sure how to accomplish that.

So why do this? 😄 Well, not only would it provide a more correct API (though I realize that the overwhelming majority of users don't care about these APIs), but it would also remove the need to partition patterns into exhaustive and nonexhaustive patterns in negations. In particular, the FileIterator::not combinator would no longer need to accept a sequence of patterns and could instead accept a single pattern. This could of course be an any combinator, so it would still be possible to use multiple patterns in a single not.

let glob = Glob::new(...).unwrap();
for entry in glob.walk(".").not(Glob::new(...).unwrap()) { ... }

This removes the need for not to handle building programs and emitting errors. More importantly, it also prevents poor performance caused by pathological inputs. Building both nonexhaustive and exhaustive patterns into an any combinator can cause a negation to consider all matches nonexhaustive and therefore read directory trees when it may not be necessary. I also plan to introduce additional combinators that would also benefit from this for the same reasons. For example, a not_except combinator:

let glob = Glob::new(...).unwrap();
for entry in glob
    .walk(".")
    .not_except(
        wax::any([..., ...]).unwrap(), // Negation.
        Glob::new(...).unwrap(), // Override. Allow this pattern despite the negation.
    )
{
    ...
}

I prefer these APIs, but they cannot efficiently filter without knowing about exhaustiveness. Given only a single pattern, these APIs would rely on exhaustiveness information in Program::matched rather than Program::exhaustiveness.

@olson-sean-k
Copy link
Owner Author

This is somewhat of a tangent, but I have a few more thoughts I wanted to write down about this and the changes it could allow.

In the examples from my comment above, not_except is a FileIterator combinator much like not but that also supports overrides. This is similar but distinct from any positive glob that initiates a walk and must be coupled with the filtering combinator, because filtering is monotonic (an override must prevent the filter from moving an entry from filtrate to residue, which is a one-way operation). not_except is a lot like gitignore, which also supports these sorts of overrides.

With better support of exhaustiveness queries and more generally accepting Programs in FileIterator, not_except may not be needed at all and instead an Except glob type can be used when needed. This further generalizes (simplifies) FileIterator combinators and provides a good target for supporting gitignore via a translation from gitignore expressions into an Except combinator under the hood. At time of writing, I plan to introduce a not_git_ignore combinator, but this wouldn't be necessary at all! Instead, Except could be constructed using functions with gitignore support and passed directly to not.

The Program trait would look something like this (some items and bounds omitted):

pub trait Program<'t>: ... {
    fn is_match<'p>(&self, candidate: impl Into<CandidatePath<'p>>) -> bool;
    fn matched<'p>(&self, candidate: &'p CandidatePath<'_>) -> Option<MatchedPath<'p>>;
    // This probably shouldn't be a member of `Program`, but is included here for illustration.
    fn residue<'p>(&self, candidate: &'p CandidatePath<'_>) -> Option<EntryResidue> {
        self.matched(candidate).map(|matched| {
            if matched.is_exhaustive() {
                EntryResidue::Tree
            }
            else {
                EntryResidue::File
            }
        })
    }
    fn exhaustiveness(&self) -> When;
}

impl<'p> MatchedPath<'p> {
    fn text(&self) -> &MatchedText<'p> { ... }
    fn is_exhaustive(&self) -> bool { ... }
}

The FileIterator trait would then look something like this (some items and bounds omitted). Note that not is called not_pattern instead here:

pub trait FileIterator: ... {
    type Entry: ...;
    type Residue: ...;

    fn not_pattern<'t>(self, program: impl Program<'t>) -> Not<'t, Self>; // All that is needed for nominal filters.
    fn not_hidden(self) -> NotHidden<Self>; // A (sometimes) non-nominal filter provided for convenience.
}

With this, the not_except example from my comment above becomes something more like this:

let glob = Glob::new(...).unwrap();
let except = wax::except(
   wax::any([..., ...]).unwrap(), // Negation.
   ..., // Override. Allow this pattern despite the negation.
).unwrap();
for entry in glob.walk(".").not_pattern(except) { ... }

Similarly, gitignore could be read, parsed, and built into an Except:

let glob = Glob::new(...).unwrap();
let except = gitignore::local_from_path(".").unwrap();
for entry in glob.walk(".").not_pattern(except) { ... }

Something more like what the ignore crate provides would look a bit like this:

let path = Path::new(".");
let except = gitignore::local_from_path(path).unwrap();
for entry in path
    .walk()
    .not_hidden()
    .not_pattern(except)
{
    ...
}

I like the shape of these APIs. In essence, it pushes the complexity of composing patterns out of FileIterator combinators and into glob combinators instead. 👍🏽

Lastly, something like this could be done today by always paying the cost of partitioning patterns in any. At time of writing, this is only done in FilterAny, which powers FileIterator::not and allows for much more reliable exhaustiveness queries when it counts. This still means that pathological patterns and constructions are possible (namely that alternatives in a Glob may have worse performance than individual patterns passed to and partitioned by an Any), but it at least enables these APIs at what I think is an acceptable cost (because Any and other glob combinators could implement Program::is_exhaustive more reliably).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant