Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B #17059

Open
wants to merge 13 commits into
base: branch-24.12
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Oct 11, 2024

Description

This PR limits the number of keys to use at a time to calculate column sizes and page_start_values to 1B averting possible OOM and UB from implicit typecasting of size_t iterator to size_type iterators in thrust::reduce_by_key.

Closes #16985
Closes #17086

Resolved

  • Add tests
  • Debug with fingerprinting structs table for a possible bug in PQ writer (nothing seems wrong with the writer as pyarrow is able to read the written parquet files).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mhaseeb123 mhaseeb123 requested a review from a team as a code owner October 11, 2024 00:44
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 11, 2024
@mhaseeb123 mhaseeb123 marked this pull request as draft October 11, 2024 00:44
@mhaseeb123 mhaseeb123 self-assigned this Oct 11, 2024
@mhaseeb123 mhaseeb123 added bug Something isn't working 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change labels Oct 11, 2024
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small comments on type safety but generally looks good.

cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 requested a review from vuule October 15, 2024 02:44
Copy link

copy-pr-bot bot commented Oct 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 15, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review October 15, 2024 02:54
@mhaseeb123 mhaseeb123 changed the title 🚧 Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B Limit the number of keys to calculate column sizes and page starts in PQ reader to 1B Oct 15, 2024
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved

// Manually create a int64_t `key_start` compatible counting_transform_iterator to avoid
// implicit casting to size_type.
auto const reduction_keys = thrust::make_transform_iterator(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The smoking gun. cudf::detail::make_counting_transform_iterator implicitly typecasts key_start to size_type causing all sorts of problems.

@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
4 participants