Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Python] test_get_file_info_with_selector is failed on AppVeyor #37555

Closed
kou opened this issue Sep 5, 2023 · 5 comments · Fixed by #37558
Closed

[CI][Python] test_get_file_info_with_selector is failed on AppVeyor #37555

kou opened this issue Sep 5, 2023 · 5 comments · Fixed by #37558

Comments

@kou
Copy link
Member

kou commented Sep 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47954752

_ test_get_file_info_with_selector[PyFileSystem(FSSpecHandler(fsspec.LocalFileSystem()))] _
fs = <pyarrow._fs.PyFileSystem object at 0x00000230E8CD5720>
pathfn = <function py_fsspec_localfs.<locals>.<lambda> at 0x00000230E4A47490>
    def test_get_file_info_with_selector(fs, pathfn):
        base_dir = pathfn('selector-dir/')
        file_a = pathfn('selector-dir/test_file_a')
        file_b = pathfn('selector-dir/test_file_b')
        dir_a = pathfn('selector-dir/test_dir_a')
        file_c = pathfn('selector-dir/test_dir_a/test_file_c')
        dir_b = pathfn('selector-dir/test_dir_b')
    
        try:
            fs.create_dir(base_dir)
            with fs.open_output_stream(file_a):
                pass
            with fs.open_output_stream(file_b):
                pass
            fs.create_dir(dir_a)
            with fs.open_output_stream(file_c):
                pass
            fs.create_dir(dir_b)
    
            # recursive selector
            selector = FileSelector(base_dir, allow_not_found=False,
                                    recursive=True)
            assert selector.base_dir == base_dir
    
            infos = fs.get_file_info(selector)
            if fs.type_name == "py::fsspec+s3":
                # s3fs only lists directories if they are not empty, but depending
                # on the s3fs/fsspec version combo, it includes the base_dir
                # (https://github.com/dask/s3fs/issues/393)
                assert (len(infos) == 4) or (len(infos) == 5)
            else:
>               assert len(infos) == 5
E               AssertionError: assert 6 == 5
E                +  where 6 = len([<FileInfo for 'C:/Users/appveyor/AppData/Local/Temp/1/pytest-of-appveyor/pytest-0/test_get_file_info_with_select8/sel...p/1/pytest-of-appveyor/pytest-0/test_get_file_info_with_select8/selector-dir/test_file_b': type=FileType.File, size=0>])
pyarrow\tests\test_fs.py:707: AssertionError
_ test_get_file_info_with_selector[PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))] _
fs = <pyarrow._fs.PyFileSystem object at 0x00000230E5148100>
pathfn = <function py_fsspec_memoryfs.<locals>.<lambda> at 0x00000230E4A47750>
    def test_get_file_info_with_selector(fs, pathfn):
        base_dir = pathfn('selector-dir/')
        file_a = pathfn('selector-dir/test_file_a')
        file_b = pathfn('selector-dir/test_file_b')
        dir_a = pathfn('selector-dir/test_dir_a')
        file_c = pathfn('selector-dir/test_dir_a/test_file_c')
        dir_b = pathfn('selector-dir/test_dir_b')
    
        try:
            fs.create_dir(base_dir)
            with fs.open_output_stream(file_a):
                pass
            with fs.open_output_stream(file_b):
                pass
            fs.create_dir(dir_a)
            with fs.open_output_stream(file_c):
                pass
            fs.create_dir(dir_b)
    
            # recursive selector
            selector = FileSelector(base_dir, allow_not_found=False,
                                    recursive=True)
            assert selector.base_dir == base_dir
    
            infos = fs.get_file_info(selector)
            if fs.type_name == "py::fsspec+s3":
                # s3fs only lists directories if they are not empty, but depending
                # on the s3fs/fsspec version combo, it includes the base_dir
                # (https://github.com/dask/s3fs/issues/393)
                assert (len(infos) == 4) or (len(infos) == 5)
            else:
>               assert len(infos) == 5
E               AssertionError: assert 6 == 5
E                +  where 6 = len([<FileInfo for '/selector-dir': type=FileType.Directory>, <FileInfo for '/selector-dir/test_dir_a': type=FileType.Dire...-dir/test_file_a': type=FileType.File, size=0>, <FileInfo for '/selector-dir/test_file_b': type=FileType.File, size=0>])
pyarrow\tests\test_fs.py:707: AssertionError
_ test_get_file_info_with_selector[PyFileSystem(FSSpecHandler(s3fs.S3FileSystem()))] _
fs = <pyarrow._fs.PyFileSystem object at 0x00000230E4971EA0>
pathfn = <method-wrapper '__add__' of str object at 0x00000230DA38F9A0>
    def test_get_file_info_with_selector(fs, pathfn):
        base_dir = pathfn('selector-dir/')
        file_a = pathfn('selector-dir/test_file_a')
        file_b = pathfn('selector-dir/test_file_b')
        dir_a = pathfn('selector-dir/test_dir_a')
        file_c = pathfn('selector-dir/test_dir_a/test_file_c')
        dir_b = pathfn('selector-dir/test_dir_b')
    
        try:
            fs.create_dir(base_dir)
            with fs.open_output_stream(file_a):
                pass
            with fs.open_output_stream(file_b):
                pass
            fs.create_dir(dir_a)
            with fs.open_output_stream(file_c):
                pass
            fs.create_dir(dir_b)
    
            # recursive selector
            selector = FileSelector(base_dir, allow_not_found=False,
                                    recursive=True)
            assert selector.base_dir == base_dir
    
            infos = fs.get_file_info(selector)
            if fs.type_name == "py::fsspec+s3":
                # s3fs only lists directories if they are not empty, but depending
                # on the s3fs/fsspec version combo, it includes the base_dir
                # (https://github.com/dask/s3fs/issues/393)
                assert (len(infos) == 4) or (len(infos) == 5)
            else:
                assert len(infos) == 5
    
            for info in infos:
                if (info.path.endswith(file_a) or info.path.endswith(file_b) or
                        info.path.endswith(file_c)):
                    assert info.type == FileType.File
                elif (info.path.rstrip("/").endswith(dir_a) or
                      info.path.rstrip("/").endswith(dir_b)):
                    assert info.type == FileType.Directory
                elif (fs.type_name == "py::fsspec+s3" and
                      info.path.rstrip("/").endswith("selector-dir")):
                    # s3fs can include base dir, see above
                    assert info.type == FileType.Directory
                else:
                    raise ValueError('unexpected path {}'.format(info.path))
                check_mtime_or_absent(info)
    
            # non-recursive selector -> not selecting the nested file_c
            selector = FileSelector(base_dir, recursive=False)
    
            infos = fs.get_file_info(selector)
            if fs.type_name == "py::fsspec+s3":
                # s3fs only lists directories if they are not empty
                # + for s3fs 0.5.2 all directories are dropped because of buggy
                # side-effect of previous find() call
                # (https://github.com/dask/s3fs/issues/410)
>               assert (len(infos) == 3) or (len(infos) == 2)
E               AssertionError: assert (4 == 3 or 4 == 2)
E                +  where 4 = len([<FileInfo for 'pyarrow-filesystem/selector-dir': type=FileType.Directory>, <FileInfo for 'pyarrow-filesystem/selector... type=FileType.File, size=0>, <FileInfo for 'pyarrow-filesystem/selector-dir/test_file_b': type=FileType.File, size=0>])
E                +  and   4 = len([<FileInfo for 'pyarrow-filesystem/selector-dir': type=FileType.Directory>, <FileInfo for 'pyarrow-filesystem/selector... type=FileType.File, size=0>, <FileInfo for 'pyarrow-filesystem/selector-dir/test_file_b': type=FileType.File, size=0>])
pyarrow\tests\test_fs.py:733: AssertionError

Note that the last succeeded build uses "fsspec 2023.6.0": https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47946265?fullLog=true#L460
And the failed build uses "fsspec 2023.9.0": https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/47954752?fullLog=true#L461

Component(s)

Continuous Integration, Python

@AlenkaF
Copy link
Member

AlenkaF commented Sep 5, 2023

I can confirm the failure locally (MacOS) with fsspec and s3fs version 2023.9.0. Trying to find the change in the fsspec that caused this.

@AlenkaF
Copy link
Member

AlenkaF commented Sep 5, 2023

I have pinpointed the PR that caused the change: fsspec/filesystem_spec#1329.

@AlenkaF
Copy link
Member

AlenkaF commented Sep 5, 2023

And I can confirm that the differences in the tests are due to the root dir also being returned with the new fsspec version. From the fsspec PR description from above:

New rules:

  • the root dir is returned when using find. It is because of the posix ** rule compliance. For example, it should return an empty string among the other paths when using ** or return ab among the other paths when using ab/**.

As the documentation about get_file_info doesn't state anything about the root dir being included or not and because the PyFileSystem is based on fsspec I think we should only update the tests to reflect the change in the fsspec library vs. changing the get_file_info() function to keep the previous behaviour (ignoring root dir).
@danepitkin @raulcd what do you think?

Will prepare a PR to update the tests to start with.

@danepitkin
Copy link
Member

Your analysis makes sense to me! +1

@h-vetinari
Copy link
Contributor

Seeing this all over the arrow feedstock. Thanks for tracking this down!

AlenkaF added a commit that referenced this issue Sep 14, 2023
…tory (#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: #37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
@AlenkaF AlenkaF added this to the 14.0.0 milestone Sep 14, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
… directory (apache#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: apache#37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
… directory (apache#37558)

### Rationale for this change

There has been some changes in the way fsspec lists the directories with new version 2023.9.0, see fsspec/filesystem_spec#1329, which caused our tests to start failing.

### What changes are included in this PR?

This PR updates the `get_file_info_selector` in [FSSpecHandler](https://arrow.apache.org/docs/_modules/pyarrow/fs.html#FSSpecHandler) class to keep the behaviour of our spec.

### Are there any user-facing changes?

No.

* Closes: apache#37555

Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants