-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazillion of ListBucket issued #938
Comments
I realise that writing this, that my situation should improve by raising the |
Well, in fact. Setting I’m not an expert on how AWS charges Unless there is a bug in your usage of |
Hey @fredDJSonos,
Yes, if your workload can tolerate stale entries or even its expected that the bucket content won't change, we'd recommend picking the longest reasonable TTL. If you never expect the content to change during the workload, you can use
Thanks for sharing the suggestion. It's something we've considered. The method for learning about a directory entry in FUSE does not include the purpose of the request, and so it's not possible to know if the application intends to - unfortunately the protocol does not indicate if it wants to learn about a file or a directory. This means that once we tell the Kernel that some path component is a directory, it will treat it like a directory from that point on without consulting Mountpoint. It's also a challenge faced in #891 where we want to allow access to directories within a bucket without having access to the paths at the root.
It does depend on the key space. If your workload can tolerate stale entries or even its expected that the bucket content won't change, we'd recommend picking the longest reasonable TTL. It will ensure that repeated lookups can be served from the cache and not need to go to S3. That means for opening files The number of requests for opening a path without metadata caching could be expressed like If its possible, performing a list of the directory before opening the files can help here as it will perform one listing through the prefix which will allow all the children to be cached.
ListObjectsV2 (referenced as ListBucket in billing) does cost more than object-level requests. The pricing is available for your region on the billing page under "Requests & data retrievals". https://aws.amazon.com/s3/pricing/
It's not possible to avoid the traversal, although we sure wish that the protocol could support it. We actually implement a small amount of caching (1 second) even when caching is turned off to try to reduce immediately making calls to the same directory again (details. The best option if you can though is to extend the metadata TTL for as long a duration as works for your workload. Ultimately, I'd make the following recommendations:
|
Thanks for your answer. Just to be clear, our last experiment was with mountOptions:
- allow-other
- region us-east-1
- cache /tmp # specify cache directory, relative to root host filesystem
- metadata-ttl indefinite # https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#metadata-cache
- max-cache-size 512 # 512MB maximum cache size
- max-threads 64 # increasing max-threads |
I guess you talk about the lookup handler you have to provide to Then it also solved #891 At the end this gives a weirdo filesystem, where all the possible directories appear to exist. But since directories don’t really exist in S3, that’s ok. |
In case there is a problem when the kernel does a
That still works for #891. |
Mountpoint for Amazon S3 version
mount-s3 1.7.0
AWS Region
us-east-1
Describe the running environment
Running inside an EKS cluster with mountpoint-s3-csi-driver.
Mountpoint options
What happened?
Our business code essentially open a file at a given path for reading its content. It might
stat
a given path. But no dir listing whatsoever happens. If we were to use S3 directly, we would just callGetObject
, and nothing else.We investigate to use
mountpoint-s3
and discover that the dominant cost (from cost explorer) is theListBucket
action.For historical reasons, we have a folder structure inherited from a real FS. It looks like this:
We have 150 millions files distributed in this folder structure.
I’m aware of this issue #770.
I wonder if you could propose an implementation that does no lookup for intermediate folders. You could pretend to fuse that all possible directory paths exist, without checking that on S3. When there is a syscall to get a file or list the content of dir, then and only then, you would call S3.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: