Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37470: [Python][Parquet] Add missing arguments to ParquetFileWriteOptions #37469

Merged
merged 5 commits into from
Sep 20, 2023

Conversation

judahrand
Copy link
Contributor

@judahrand judahrand commented Aug 30, 2023

Rationale for this change

I think this may have been missed when this feature was added.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions github-actions bot added the awaiting review Awaiting review label Aug 30, 2023
@mapleFU
Copy link
Member

mapleFU commented Aug 30, 2023

Would you mind create a Github issue, and rename the title to:

GH-{issue-id}: [Python][Parquet] ...

@judahrand judahrand changed the title Allow write_page_index to be set for Datasets GH-37470: Allow write_page_index to be set for Datasets Aug 30, 2023
@github-actions
Copy link

⚠️ GitHub issue #37470 has been automatically assigned in GitHub to PR creator.

@judahrand judahrand changed the title GH-37470: Allow write_page_index to be set for Datasets GH-37470: [Python][Parquet] Allow write_page_index to be set for Datasets Aug 30, 2023
@judahrand judahrand changed the title GH-37470: [Python][Parquet] Allow write_page_index to be set for Datasets GH-37470: [Python][Parquet] Add missing arguments to ParquetFileWriteOptions Aug 30, 2023
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind add a test for write_page_index? Since we need to ensure page index would be written.

Besides, disable statistics for column can force only write offset-index, maybe we can just test them

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 2, 2023
@judahrand
Copy link
Contributor Author

judahrand commented Sep 8, 2023

Would you mind add a test for write_page_index? Since we need to ensure page index would be written.

I've added a test for this.

@mapleFU
Copy link
Member

mapleFU commented Sep 8, 2023

Great! I'll take a careful round later. By the way we can disable statistics to just disable Column Index but write offset index. Maybe this could be add to doc later.

@judahrand
Copy link
Contributor Author

judahrand commented Sep 8, 2023

Great! I'll take a careful round later. By the way we can disable statistics to just disable Column Index but write offset index. Maybe this could be add to doc later.

Are you saying write_statistics=False, wrtie_page_index=True will write the OffsetIndex but not the ColumnIndex? If that's the case then I'd absolutely agree the docs should be updated as currently they suggest that write_statistics is ignored if write_page_index=True! But this should probably be a separate PR as the docs will need updating in pyarrow.parquet too?

@judahrand
Copy link
Contributor Author

Great! I'll take a careful round later. By the way we can disable statistics to just disable Column Index but write offset index. Maybe this could be add to doc later.

Are you saying write_statistics=False, wrtie_page_index=True will write the OffsetIndex but not the ColumnIndex? If that's the case then I'd absolutely agree the docs should be updated as currently they suggest that write_statistics is ignored if write_page_index=True! But this should probably be a separate PR as the docs will need updating in pyarrow.parquet too?

That is indeed what you were saying as the test shows! That behaviour strikes me as quite unintuitive!

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is ok here, however, I think we should export the flag to parquet/core.py, and from that to _dataset_parquet.pyx. You can take a look at your another patch ( #37665 ). Since just a _dataset_parquet config is hard to use for users.

@mapleFU
Copy link
Member

mapleFU commented Sep 11, 2023

That is indeed what you were saying as the test shows! That behaviour strikes me as quite unintuitive!

Cause I'm the author of that behavior lol ( #35958 )

When disable collect statistics, currently it would be hard to collect the ColumnIndex, because ColumnIndex relies on Statistics.

Also, by the way, by default, if column index is enabled, the page header statistics will not be written. (Since spec says if column index exists, page header is not tent to be written)

@judahrand
Copy link
Contributor Author

When disable collect statistics, currently it would be hard to collect the ColumnIndex, because ColumnIndex relies on Statistics.

Yeah, that makes sense! Would be good to clarify in the docs - will see if I get to it in another PR.

Also, by the way, by default, if column index is enabled, the page header statistics will not be written. (Since spec says if column index exists, page header is not tent to be written)

Yeah, what gets written and what doesn't is quite confusing. In fact the spec doesn't techincally say that the page-level statistics when writing the ColumnIndex but that it isn't recommended (one might want both in order to support old readers).
https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach

It's probably sensible default behaviour but it'd be nice to force being able to write both.

@judahrand
Copy link
Contributor Author

judahrand commented Sep 11, 2023

The logic is ok here, however, I think we should export the flag to parquet/core.py, and from that to _dataset_parquet.pyx. You can take a look at your another patch ( #37665 ). Since just a _dataset_parquet config is hard to use for users.

The flags are already there. They were just never added to the ParquetFileWriteOptions:

write_page_index=False,

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply. Finally I got this. User can write the data with these options in ParquetWriter python sdk, but cannot use them with ParquetFileWriteOptions.

This is LGTM.

@jorisvandenbossche @AlenkaF Would you mind take a look

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@AlenkaF
Copy link
Member

AlenkaF commented Sep 15, 2023

One thought: would it be worth adding docstrings for ParquetFileWriteOptions and adding examples of use to the Python user guide? As a separate issue?

@mapleFU
Copy link
Member

mapleFU commented Sep 20, 2023

Any updates? Should we rerun continuous-integration/appveyor/pr and checkin if it's a unrelated flacky test?

@AlenkaF
Copy link
Member

AlenkaF commented Sep 20, 2023

Any updates? Should we rerun continuous-integration/appveyor/pr and checkin if it's a unrelated flacky test?

The failing AppVeyor test would be fixed with rebasing the branch (#37555).

It would be good to see if something can be done regarding the documentation (see) before merging.

@judahrand
Copy link
Contributor Author

Any updates? Should we rerun continuous-integration/appveyor/pr and checkin if it's a unrelated flacky test?

The failing AppVeyor test would be fixed with rebasing the branch (#37555).

Thanks, I've done this now.

It would be good to see if something can be done regarding the documentation (see) before merging.

I can try to make time to get to this but it feels like a pretty independent issue - especially as the WriteOptions classes don't even appear in the Python docs: https://arrow.apache.org/docs/python/api/dataset.html#classes. It feels like it would make more sense to add documentation for IPC, CSV, ORC, and Parquet FileWriteOptions all in one PR?

@AlenkaF
Copy link
Member

AlenkaF commented Sep 20, 2023

I can try to make time to get to this but it feels like a pretty independent issue - especially as the WriteOptions classes don't even appear in the Python docs: https://arrow.apache.org/docs/python/api/dataset.html#classes. It feels like it would make more sense to add documentation for IPC, CSV, ORC, and Parquet FileWriteOptions all in one PR?

Agree totally, it deserves a special issue for sure! What you propose sounds great, thank you!
It would be enough for now just to open a separate issue and link it from here, so we do not forget.

@judahrand
Copy link
Contributor Author

Agree totally, it deserves a special issue for sure! What you propose sounds great, thank you! It would be enough for now just to open a separate issue and link it from here, so we do not forget.

#37800

@mapleFU
Copy link
Member

mapleFU commented Sep 20, 2023

All test passed now, can we try to checkin this patch? :-)

@AlenkaF
Copy link
Member

AlenkaF commented Sep 20, 2023

Great, will merge 👍

@AlenkaF AlenkaF merged commit f52ebbb into apache:main Sep 20, 2023
12 checks passed
@AlenkaF AlenkaF removed the awaiting committer review Awaiting committer review label Sep 20, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit f52ebbb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…leWriteOptions` (apache#37469)

### Rationale for this change

I think this may have been missed when this feature was added.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#37470

Authored-by: Judah Rand <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
AlenkaF pushed a commit that referenced this pull request Dec 20, 2023
### Rationale for this change

Picking up where #35453 left off.

Closes #35331

This PR builds on top of #37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #35331

Lead-authored-by: Judah Rand <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…leWriteOptions` (apache#37469)

### Rationale for this change

I think this may have been missed when this feature was added.

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#37470

Authored-by: Judah Rand <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change

Picking up where apache#35453 left off.

Closes apache#35331

This PR builds on top of apache#37469 

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#35331

Lead-authored-by: Judah Rand <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Parquet] write_page_index to be set for Datasets
3 participants