Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Use random access file instead of mmap #571

Open
3 tasks done
agoncharuk opened this issue Feb 24, 2025 · 2 comments
Open
3 tasks done

Feature: Use random access file instead of mmap #571

agoncharuk opened this issue Feb 24, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@agoncharuk
Copy link

Describe what you are looking for

Currently, the index either loads the whole file from a stream or mmaps the index file and relies on the OS to fetch relevant pieces of the index. This requires the whole index file to be available locally, which does not play well for data lake environments where files are stored on a remote storage.

I was wondering if the index file structure is suitable for partial load, such that traversing the index will request relevant parts of the file through a user-provided interface (similar to how e.g. certain row groups and column chunks are fetched during a parquet file scan)?

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@agoncharuk agoncharuk added the enhancement New feature or request label Feb 24, 2025
@ashvardanian
Copy link
Contributor

Several teams have previously successfully adjusted the core C++ engine to move the storage elsewhere. @agoncharuk, is there a specific project in mind or a more specific set of constraints?

@agoncharuk
Copy link
Author

Yes, I saw both the Clickhouse and DuckDB integrations which totally make sense!
I am currently investigating a Trino integration possibility, specifically in Hive/Iceberg deployments where data resides on S3/Hdfs and workers never keep the dataset locally. Fetching the whole index file will certainly work, however, I was wondering if a partial index read is possible, at least in theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants