Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit PhisicalIO from pre-fetching beyond the Row Group boundary when reading Parquet #164

Open
1 task done
oleg-lvovitch-aws opened this issue Nov 22, 2024 · 0 comments

Comments

@oleg-lvovitch-aws
Copy link
Collaborator

Tell us more about this new feature.

This idea is courtesy @ahmarsuhail.
Today PhisicalIO expands the read window based on the prior sequential read patterns. This is sensible, however when reading Parquet, pre-fetching past the boundary of a RG or a footer (for which there should be no prefetching anyway) never makes sense.
Given that we already know the size of the RG in LogicalIO, we know the upper boundary on the prefetch, reducing over-reads.
The strawman of the approach: Phisical IO allows upper bound specification for each request, and Parquet Logical IO passes it on relevant fetches.

Code of Conduct

  • I agree to follow this project's Code of Conduct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant