- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: | 
e4de469    to
    c216849      
    Compare
  
    | @pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...
| template <typename Builder> | ||
| void SerializeIndex( | ||
| const std::vector<std::vector<std::unique_ptr<Builder>>>& page_index_builders, | ||
| const std::vector<std::vector<std::unique_ptr<Index<Builder>>>>& page_indices, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this separate to different method? This reuse is a bit hacking to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed all the changes yet and will progressively post my comments.
| /// \brief List of repetition level histograms for each page concatenated together. | ||
| virtual const std::vector<int64_t>& repetition_level_histograms() const = 0; | ||
|  | ||
| virtual const void* to_thrift() const = 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is on the critical path. Is it simpler and cleaner to convert ColumnIndex and OffsetIndex to their thrift equivalents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to remove format::OffsetIndex in OffsetIndexImpl in my first commit, and add a new method static inline format::PageLocation ToThrift(const PageLocation&) in thrift_internal.h to convert offset index to its thrift format. But there was a format::ColumnIndex in TypedColumnIndexImpl which is used to avoid copying some vectors.
@pitrou Do you think we should preserve the ColumnIndex::to_thrift method or write a similar static inline format::ColumnIndex ToThrift(const TypedColumnIndexImpl<>&)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general workflow of the rewriter looks good to me. However, I don't believe we should directly manipulate the thrift objects.
| RowGroupRewriter(std::shared_ptr<ArrowInputFile> source, | ||
| std::shared_ptr<ArrowOutputStream> sink, | ||
| const RewriterProperties* props, | ||
| std::shared_ptr<RowGroupReader> row_group_reader, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps introduce a RowGroupContext to hold all row group xxx readers?
|  | ||
| // TODO(HuaHuaY): copy column index and bloom filter instead of reading and | ||
| // writing it. | ||
| auto column_index = page_index_reader_->GetColumnIndex(i); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check if any of these is nullptr since they are optional.
| auto column_index = page_index_reader_->GetColumnIndex(i); | ||
| auto offset_index = page_index_reader_->GetOffsetIndex(i); | ||
| // TODO(HuaHuaY): support bloom filter writer | ||
| [[maybe_unused]] auto bloom_filter = | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is blocked by #37400
| } | ||
|  | ||
| total_bytes_written += metadata_->total_byte_size(); | ||
| } else { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need to split this into smaller but dedicated functions?
        
          
                cpp/src/parquet/file_rewriter.cc
              
                Outdated
          
        
      | bool HasData() { | ||
| while (current_rewriter_index_ < rewriters_.size() && | ||
| !rewriters_[current_rewriter_index_]->HasData()) { | ||
| rewriters_[current_rewriter_index_]->Close(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a debug log to show the progress?
        
          
                cpp/src/parquet/file_rewriter.cc
              
                Outdated
          
        
      | } | ||
| } | ||
|  | ||
| const SchemaDescriptor* schema() const { return rewriters_[0]->schema(); } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still valid once Close has been called on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. rewriters_[0] isn't set to nullptr after Close is called in the current code.
| size_t current_rewriter_index_{}; | ||
| }; | ||
|  | ||
| class ExtendRewriter { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why naming it ExtendRewriter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestion? I just want to distinguish it from ConcatRewriter. If you don't like this name, should we rename it to MergeRewriter?
e037be7    to
    253f281      
    Compare
  
    
This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting
XXXWriterorXXXReader. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.
Rationale for this change
Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.
What changes are included in this PR?
ParquetFileRewriterandRewriterProperties.to_thriftandSetXXXmethods to help me copy the metadata.CopyStreammethods to callmemcpybetweenArrowInputStreamandArrowOutputStream.RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift)which allows to add column metadata without creatingColumnChunkMetaDataBuilder.Are these changes tested?
Yes
Are there any user-facing changes?
ReaderProperties::GetStreamis changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.