Feature: Use random access file instead of mmap #571

agoncharuk · 2025-02-24T22:12:40Z

Describe what you are looking for

Currently, the index either loads the whole file from a stream or mmaps the index file and relies on the OS to fetch relevant pieces of the index. This requires the whole index file to be available locally, which does not play well for data lake environments where files are stored on a remote storage.

I was wondering if the index file structure is suitable for partial load, such that traversing the index will request relevant parts of the file through a user-provided interface (similar to how e.g. certain row groups and column chunks are fetched during a parquet file scan)?

Can you contribute to the implementation?

I can contribute

Is your feature request specific to a certain interface?

C++ implementation

Contact Details

No response

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

ashvardanian · 2025-02-24T22:18:29Z

Several teams have previously successfully adjusted the core C++ engine to move the storage elsewhere. @agoncharuk, is there a specific project in mind or a more specific set of constraints?

agoncharuk · 2025-02-25T07:11:39Z

Yes, I saw both the Clickhouse and DuckDB integrations which totally make sense!
I am currently investigating a Trino integration possibility, specifically in Hive/Iceberg deployments where data resides on S3/Hdfs and workers never keep the dataset locally. Fetching the whole index file will certainly work, however, I was wondering if a partial index read is possible, at least in theory.

agoncharuk added the enhancement New feature or request label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Use random access file instead of mmap #571

Feature: Use random access file instead of mmap #571

agoncharuk commented Feb 24, 2025

ashvardanian commented Feb 24, 2025

agoncharuk commented Feb 25, 2025

Feature: Use random access file instead of mmap #571

Feature: Use random access file instead of mmap #571

Comments

agoncharuk commented Feb 24, 2025

Describe what you are looking for

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

ashvardanian commented Feb 24, 2025

agoncharuk commented Feb 25, 2025