Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid destination directory creation when using glob double asterisk(**) #1321

Closed
john-jam opened this issue Jul 31, 2023 · 9 comments · Fixed by #1329
Closed

Avoid destination directory creation when using glob double asterisk(**) #1321

john-jam opened this issue Jul 31, 2023 · 9 comments · Fixed by #1329

Comments

@john-jam
Copy link
Contributor

john-jam commented Jul 31, 2023

I would like to avoid creating the last folder of the source path at the destination when glob double asterisk(**) to select files with specific extensions are used. When using them in the copy method (also happens on get or put) on the LocalFileSystem (and other remote fs implementation as well), a destination folder (the last folder of the source path) is created in several unwanted situations.

Example

Code:

from fsspec.implementations.local import LocalFileSystem

fs = LocalFileSystem(auto_mkdir=True)

# List source files
print(f"Source files: {fs.glob('/tmp/data/**.txt')}")

# Copy to a location with trailing slash
fs.copy(path1="/tmp/data/**.txt", path2="/tmp/out/")
print(f"Test1: {fs.glob('/tmp/out/**.txt')}")
fs.rm(path="/tmp/out", recursive=True)

# Copy to a location without trailing slash
fs.copy(path1="/tmp/data/**.txt", path2="/tmp/out")
print(f"Test2: {fs.glob('/tmp/out/**.txt')}")
fs.rm(path="/tmp/out", recursive=True)

# Copy twice to a location without trailing slash
fs.copy(path1="/tmp/data/**.txt", path2="/tmp/out")
fs.copy(path1="/tmp/data/**.txt", path2="/tmp/out")
print(f"Test3: {fs.glob('/tmp/out/**.txt')}")
fs.rm(path="/tmp/out", recursive=True)

Output:

Source files: ['/tmp/data/a/b/f3.txt', '/tmp/data/a/f2.txt', '/tmp/data/f1.txt']
Test1: ['/tmp/out/data/a/b/f3.txt', '/tmp/out/data/a/f2.txt', '/tmp/out/data/f1.txt']
Test2: ['/tmp/out/a/b/f3.txt', '/tmp/out/a/f2.txt', '/tmp/out/f1.txt']
Test3: ['/tmp/out/a/b/f3.txt', '/tmp/out/a/f2.txt', '/tmp/out/data/a/b/f3.txt', '/tmp/out/data/a/f2.txt', '/tmp/out/data/f1.txt', '/tmp/out/f1.txt']

When a trailing slash is added to the destination (Test1), the folder data is created. When the trailing slash is not added and we copy only once (Test2), the folder data is not created (expected results). But when we do twice the copy operation, the folder data is created once again.

Expect: ['/tmp/out/a/b/f3.txt', '/tmp/out/a/f2.txt', '/tmp/out/f1.txt'] even when we call copy twice.

@martindurant
Copy link
Member

Can you please check against https://filesystem-spec.readthedocs.io/en/latest/copying.html and see if your cases violate any of the stated desired behaviours?

I would note that it's always possible to go your own way, by performing the glob first and then a list comprehension to make the set of paths you wish to copy to, since copy() allows for list arguments for source and destination.

Is this the same as #1315 ?

@john-jam
Copy link
Contributor Author

john-jam commented Jul 31, 2023

@martindurant Thank you for you response!

I checked the expected behaviors you mentioned but I can't find a use case with the double asterisk (and with or without file extension suffix).
But I think the closest expected behavior would be the 1h. Glob to new directory (recursive). In the example, when copying from source/subdir/*, the subdir folder is not created under the destination target/newdir/ and the files like subfile1 is right below newdir. Doing the same call but with a double asterisk and with recursive=False should act similarly as this example and not create the subdir folder (see first example below).

After some digging, it seems related to the exists option of the other_paths utils method and its value definition in the copy method for example. When forcing it to False, the last folder is not removed from the common prefix and then replaced by the right path2 value.
This method also impact the behavior when we use double asterisk without file extensions (with or without a trailing slash).

Here are some examples:

from fsspec.implementations.memory import MemoryFileSystem

fs = MemoryFileSystem()

fs.touch(path="/test/data/a.txt")
fs.touch(path="/test/data/b/c.txt")
fs.touch(path="/test/data/d/e/f.txt")

# Double asterisk without trailing slashes and without file extension suffix
fs.copy(path1="/test/data/**", path2="/test/out/")
# Output: ['/test/out/data/a.txt', '/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected)

# Double asterisk with trailing slashes and without file extension suffix
fs.copy(path1="/test/data/**/", path2="/test/out/")
# Output: ['/test/out/a.txt', '/test/out/b/c.txt', '/test/out/d/e/f.txt'] (expected)

# Double asterisk without trailing slashes and with file extension suffix
fs.copy(path1="/test/data/**.txt", path2="/test/out/")
# Output: ['/test/out/data/a.txt', '/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected)

# Double asterisk with trailing slashes and with file extension suffix
fs.copy(path1="/test/data/**/*.txt", path2="/test/out/")
# Output: ['/test/out/data/b/c.txt', '/test/out/data/d/e/f.txt'] (data folder not expected and file a.txt missing, possibly related to the issue you linked)

# Double asterisk without trailing slashes, with file extension suffix and with `exists` in `other_paths` forced to False
fs.copy(path1="/test/data/**.txt", path2="/test/out/")
# Output: ['/test/out/a.txt', '/test/out/b/c.txt', '/test/out/d/e/f.txt'] (expected)

Do you know which use case requires to have exists to True (can't think of one but I'm really new to fsspec)? If forcing this value to False is not a suitable solution, maybe an update in the trailing_sep_maybe_asterisk method could support the double asterisk?
Also, the issue #1315 you mentioned appears when calling the expand_path method so before calling the other_paths method. Both issues seems unrelated.

I would be glad to contribute if this is something you would like to support!

@ianthomas23
Copy link
Collaborator

@john-jam Thanks for looking at this. I think adding support for "**" in trailing_sep_maybe_asterisk is a good approach.

For testing, I would be inclined to modify test_copy_glob_to_existing_directory to change

for target_slash in [False, True]:
    ...
    fs.cp(fs_join(source, "subdir", "*"), t)

to something like

from itertools import product
for glob, target_slash in product(["*", "**"], [False, True]):
    ....
    fs.cp(fs_join(source, "subdir", glob), t)

and then run the test suite as normal to see what happens.

If you'd like to try something like this then that would be great. If you get into trouble then I'll happily help out.

@john-jam
Copy link
Contributor Author

john-jam commented Aug 4, 2023

@ianthomas23 @martindurant I am currently preparing a PR to fix those behaviors and I have something working currently but I have some question for you to clarify the expectations with double asterisks **.

What are the relations between recursive and globs in paths?

I believe that double asterisks ** would behave similarly as when we pass recursive=True but what about when we use a single asterisk * and recursive?

To clarify, here are currently identical calls. Could you confirm that those calls are meant to be identical?:

# 1
fs.copy(path1="/source/subdir/", target="/target/newdir")
fs.copy(path1="/source/subdir/*", target="/target/newdir")

# 2
fs.copy(path1="/source/subdir/", target="/target/newdir", recursive=True)
fs.copy(path1="/source/subdir/*", target="/target/newdir", recursive=True)

# 3
fs.copy(path1="/source/subdir/", target="/target/newdir", recursive=True)
fs.copy(path1="/source/subdir/**", target="/target/newdir")

# 4
fs.copy(path1="/source/subdir/*", target="/target/newdir", recursive=True)
fs.copy(path1="/source/subdir/**", target="/target/newdir")

Are single asterisk * and double asterisks ** allowed at other places than the last one in the paths during copy, get, put?

In a lot of test case in this repo, we can see some examples with those chars at the very end of paths like .../*, .../**, .../*/ , .../**/ or , .../abc*.
But should we support having them in the middle of paths like the following?:

# 5
fs.copy(path1="/source/*/file*", target="/target/newdir")

# 6
fs.copy(path1="/source/**/*file*", target="/target/newdir")

Should maxdepth be supported when using double asterisks ** with glob?

Since a glob call is done in the expand_path method, do we want to support maxdepth when we use double asterisks ** and if yes, what would be the behavior of the following?:

# 7
fs.glob(path="/source/subdir/**", maxdepth=1)

# 8
fs.glob(path="/source/**/*file*", maxdepth=1)

# 9
# Here, the question is on which double asterisk `**` we should apply the `maxdepth`
fs.glob(path="/source/**/nesteddir/**", maxdepth=1)

I have something working for 7 and 8 here, but this does not work for 9 yet.

@martindurant
Copy link
Member

relationship between glob and recursive

You could imagine it as a two-step process, that first you get a list of paths from the glob, and then all of the files within each of those when recursive.

Are single asterisk * and double asterisks ** allowed at other places than the last one

Yes

Should maxdepth be supported when using double asterisks **

We very likely haven't tested this case. Glob with ** works by listing all paths within the highest common root. For "/path/part/**/more/st*uff" this would be "/path/part". Maxdepth becomes the number of dir levels past that point which are allowed; so in this example, the minimum maxdepth that returns anything is 2, e.g., some file like "/path/path/more/st88uff". I think any other meaning for maxdepth is ambiguous as you point out.

@john-jam
Copy link
Contributor Author

john-jam commented Aug 4, 2023

You could imagine it as a two-step process, that first you get a list of paths from the glob, and then all of the files within each of those when recursive.

I see. So when using double asterisks ** in globs, the 2 steps process makes naturally the recursive option ignored right? If recursive is False, then the first step already return paths recursively with the double asterisk and when recursive is True, we already retrieve all the files at first step so no need to go deeper.

Are single asterisk * and double asterisks ** allowed at other places than the last one
Yes

I'll add some test cases for this one because I can see some unexpected behaviors when having suffixes.

so in this example, the minimum maxdepth that returns anything is 2

I don't get why it would be 2 in your example? Is it 2 because you forgot to mention that you select maxdepth=2 in the glob call or is it 2 because there are 2 levels after the double asterisks (more/st*uff)? I would expect to have paths like /path/part/morepath/morepath/more/st88uff as well when using your glob example and without maxdepth option passed. And when using maxdepth=2 in your example, I would expect 0 results because the st*uff part is at depth 3. But it could be at depth 2 if this issue is indeed an issue and double asterisks with trailing separator **/ should match 0+ directories. In that eventuality only, we would have results like /path/path/more/st88uff. What do you think?

On a side note, applying only the maxdepth on the first double asterisks ** makes sense.

@martindurant
Copy link
Member

with the double asterisk and when recursive is True, we already retrieve all the files at first step so no need to go deeper.

No, this is not necessarily true if ** isn't the end of your input path

it could be at depth 2 if this #1315 is indeed an issue and double asterisks with trailing separator **/

Exactly, I think it should match zero or more. It should be checked against posix docs that this is the actual convention.

@john-jam
Copy link
Contributor Author

john-jam commented Aug 5, 2023

No, this is not necessarily true if ** isn't the end of your input path

Oh right, I was talking about the case where ** or **/ are at the end but forgot the case like /path/part/**/subpath where we should go deeper if recursive=True, gotcha.

Exactly, I think it should match zero or more. It should be checked against posix docs that this is the actual convention.

Okay then this needs to be supported as well.

@john-jam
Copy link
Contributor Author

john-jam commented Aug 6, 2023

@martindurant After some digging online, it looks like posix does not support double asterisks ** by default: https://man7.org/linux/man-pages/man7/glob.7.html
There are, however, some supports in bash and python which somehow extend the posix rules.

I created a draft PR for now, with what I believe, a better support for the "extended-posix" rules. The added tests run several paths against the glob implementation in python, the bash stat command with globstar enabled and fsspec (local implementation) with my fixes.

Some questions/comments originated from my tests and I would like your thoughts before I'll try supporting maxdepth and fixing the copy method unwanted directory creation:

  • the root dir should be returned when using double asterisks **. For example, it should return an empty string among the other paths when using ** or return ab among the other paths when using ab/**. I added a fix in the find method so fsspec also returns the root dir if it exists but this makes other tests fail.
  • when using a trailing slash, it should return only directories, I made a fix to return only directories when there's a trailing slash initially in the path.
  • The double asterisks ** indeed should match 0+ directories so the regex pattern had to be updated.
  • glob in python accepts a recursive option and only handle ** if True. Do we want the same behavior for fsspec glob and accept a recursive option? This would clarify the case /path/part/**/more/stuff when glob is used from copy for example.
  • glob in python and stat in bash, both do not support double asterisks with other prefix or suffix than a slash or an empty string. So for example, **.json is the same as *.json and not the same as **/*.json. But since it's already supported in fsspec, I think this could be great to keep it. The tests done to only fsspec implementation are here to check if those non-posix use cases still work with my updates.

Please let me know if there are some tests I forgot! I'll add tests for maxdepth and copy soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants