-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching for frequently accessed data files #135
Comments
No, it is not supported unfortunately 😔 |
This is indeed not yet supported, a filesystem-level cache in DuckDB is really high on my wishlist though. Having DuckDB + buffer managed (i.e. disk offloadable) FS-level cache + delta sounds like a super sweet setup |
Thanks @djouallah and @samansmink, I am glad to hear that it's on the list! It does look like duckdb supports forward HTTP proxy. Would delta extension honour the HTTP proxy settings when accessing the blob storage? A wild idea I have is to try putting a forward proxy like squid next to duckdb, give it a large disk cache, and see if it can reduce round-trip times to the blob storage. Does this sounds like a workable approach? |
@igor-lobanov-maersk we actually use squid for testing azure, see this and that. I suspect this doesn't work right now on delta though because i think we'd have to forward the proxy config to the delta kernel since it does its own IO right now. It would make for an interesting experiment for sure! |
Nice, thanks @samansmink. Do you know if there a way to configure forward proxy at the delta kernel level? A quick check in the delta kernel repo did not yield anything apparently useful, but this is far from conclusive. |
For the record, I raised this as a feature request with delta kernel team as delta-io/delta-kernel-rs#649. @samansmink could you give any indication whether supporting FS-offloadable cache for duckdb-delta is on the roadmap? |
Hey all - came over here for some context for the kernel-rs issue.
In principle all IO is (or can be) handled by the Lines 177 to 178 in 7f8cc36
In this case the object store interaction would be handled by the Turns out, you can just pass in http client config as part of the "regular" configuration options. Not sure how the config is wired through duckdb through :). |
I finally got around to do some experiments with Azure blob storage and the delta extension. Good news: it seems that delta-kernel-rs honours HTTPS_PROXY environment variable in Linux. In a basic setup with mitmproxy I can see HEAD and GET requests to table metadata and individual parquet files appearing in the web console. I'll try with squid next. |
Having looked at the traces of GET requests for the data files I learned that Azure Storage API client used by duckdb-delta extension relies on a non-standard header ![]() Interestingly, GET requests to metadata checkpoints in ![]() It seems to get this to work with a caching forward proxy would require some complex request rewriting, and I am no longer sure if it worth getting into that unless I can somehow convince the API client to always use |
It's on an internal wish-list of features, but I can't give any concrete timeline.
So when querying a Delta Table on Azure using this extension, IO will be performed by:
If adding a generic http proxy cache turns out to be complex or require significant reworks of the azure extension, I would say its likely not worth it. A cache at the DuckDB filesystem level seems the most natural to me and the way forwards here |
I have a scenario when I need to provide a lookup API on top of a delta lake table, and I'm considering duckdb straight on top of ADLS. I have a conceptual question regarding delta scan implementation for which I cannot find any technical details documented, so would appreciate your input.
Most API calls will be clustered around a small subset of the data, so I'm likely going to have a few 'hot' data files getting most of the traffic. I wonder if duckdb does (or can be configured to) cache recently accessed data files of a delta lake table, so that the number of blob reads to ADLS is reduced and the likelihood of Azure API request throttling is reduced?
The text was updated successfully, but these errors were encountered: