Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow LIST performance with mountpoint #945

Open
bradyforcier1 opened this issue Jul 17, 2024 · 3 comments
Open

Slow LIST performance with mountpoint #945

bradyforcier1 opened this issue Jul 17, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@bradyforcier1
Copy link

bradyforcier1 commented Jul 17, 2024

Tell us more about this new feature.

Background

Testing with latest version v1.7.2, I've noticed the performance of LISTs is significantly slower than other methods.

Test Setup

Test setup is recursively listing a prefix hierarchy with ~16,000 objects total

Mountpoint command: sudo mount-s3 --read-only --allow-other --max-cache-size 50000 --cache /tmp/mtpt_cache --metadata-ttl 300 $BUCKET /tmp/mtpt_test
goofys command: sudo /usr/local/bin/goofys --type-cache-ttl 60s --stat-cache-ttl 60s --file-mode 0555 --dir-mode 0555 -o ro -o allow_other $BUCKET /tmp/goofys_test

  • awscli
time aws s3 ls --recursive s3://$BUCKET/$PREFIX
real	0m4.577s
user	0m3.085s
sys	0m0.128s
  • mountpoint (caching is enabled, but it seems like LIST responses aren't cached so subsequent lists are still slow)
time find /tmp/mtpt_test/$PREFIX -type f
real    0m38.283s
user    0m0.011s
sys 0m0.079s
  • goofys
time find /tmp/goofys_test/$PREFIX -type f
real    0m2.333s
user    0m0.006s
sys 0m0.053s
@bradyforcier1 bradyforcier1 added the enhancement New feature or request label Jul 17, 2024
@monthonk
Copy link
Contributor

Hey, I can confirm that Mountpoint doesn't cache any LIST responses today, so every readdir operation would go directly to S3. The metadata cache is mainly used for lookup operation. I didn't expect the result to be this much worse though, because we may need to do readdir only once for each directory while traversing through them.

It would be really helpful to understand access pattern of the find command so we will need debug logs from your test. Also, I would like to understand more about the structure of your bucket, like how many levels of subdirectory are under the test prefix. Could you share more information about that?

@bradyforcier1
Copy link
Author

bradyforcier1 commented Jul 24, 2024

so we will need debug logs from your test

I would not be comfortable sharing the debug logs as they will contain details about the bucket names/path which may contain sensitive data. But the command is just using the bash find utility to report all files recursively underneath a prefix

Also, I would like to understand more about the structure of your bucket, like how many levels of subdirectory are under the test prefix. Could you share more information about that?

In this case, the content of the root prefix we're recursively listing looks like:

   ├── prefix1
    │   ├── a
    │   ├── b
    │   └── prefix1.1
    │       ├── a
    │       ├── b
    │       ├── c
    │       ├── d
    │       ├── e
    │       ├── f
    │       └── g
   ├── prefix2
   ...

Where there are ~100 prefixes and each prefix contains ~145 objects spread across the subdirs. In this test there were a total of 1900 prefixes traversed

@monthonk
Copy link
Contributor

Thanks for sharing the structure. Seems like the problem will show up only when there is a lot of prefixes to traverse since I didn't face the same issue when trying to reproduce it with a few subdirectories. I will bring it back to the team and find out how we can test this and make directory listing more performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants