Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True #14159

Closed
rjzamora opened this issue Sep 21, 2023 · 0 comments · Fixed by #15373
Closed
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.

Comments

@rjzamora
Copy link
Member

Describe the bug
It is my understanding that we want the DataFrame.to_arrow API to be consistent with pa.Table.from_pandas() (when possible). This is currently not the case when DataFrame.index is a RangeIndex, and preserve_index=True is specified. In this case, pa.Table.from_pandas() will use the RangeIndex information to produce an explicit "__index_level_0__" column in the output pyarrow Table.

Side Note: Creating an explicit column is the only way to generate a Table schema that will "safely" preserve the index in dask, because that schema may be used to "rebuild" partitions with a different number of rows later on (and so the original start/stop metadata can be "wrong").

Steps/Code to reproduce bug

In [1]: import pandas as pd, cudf
In [2]: import pyarrow as pa
In [3]: df = cudf.DataFrame({"x": ["cat", "dog"] * 5})
In [4]: pdf = df.to_pandas()

The schema extracted from pandas will have an "__index_level_0__" column when preserve_index=True, and will store RangeIndex metadata if preserve_index=True or preserve_index=None:

In [6]: pa.Table.from_pandas(pdf, preserve_index=None).schema
Out[6]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 362
In [7]: pa.Table.from_pandas(pdf, preserve_index=True).schema
Out[7]: 
x: string
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 431

The schema extracted from cudf will never have an explicit index column, and will only store RangeIndex metadata if preserve_index=True.

In [8]: df.to_arrow(preserve_index=None).schema
Out[8]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 177
In [9]: df.to_arrow(preserve_index=True).schema
Out[9]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 362

Expected behavior
I'd like for DataFrame.to_arrow(preserve_index=True) to be consistent with pa.Table.from_pandas(..., preserve_index=True). More specifically, I'd like cudf to produce an explicit column in the pyarrow Table. When the original RangeIndex is un-named, it probably makes sense to call the column "__index_level_0__". However, DataFrame.from_arrow will also need to recognize that "__index_level_0__" should round-trip to an un-named index.

Additional context
This bug complicates #13893, because distributed's "p2p" shuffle now assumes that to_pyarrow_table_dispatch will include the actual data for an index column when preserve_index=True (not just range metadata). We can certainly include a band-aid for this in dask_cudf, but the best long-term fix belongs in cudf.

@rjzamora rjzamora added bug Something isn't working Needs Triage Need team to review and classify labels Sep 21, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 9, 2023
@wence- wence- self-assigned this Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
When preserving the index and we have a RangeIndex, we must
materialize it, and write that information in the metadata correctly.

- Closes rapidsai#14159
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
@wence- wence- changed the title [BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True [BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
When preserving the index and we have a RangeIndex, we must
materialize it, and write that information in the metadata correctly.

- Closes rapidsai#14159
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
When preserving the index and we have a RangeIndex, we must
materialize it, and write that information in the metadata correctly.

- Closes rapidsai#14159
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
@rapids-bot rapids-bot bot closed this as completed in dda3f31 Mar 23, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 22, 2024
Looks like these overrides should be safe to remove now that #14159 is closed out.

This should unblock the GPU CI failures we're seeing on Dask with 24.06 in dask/dask#11045.

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants