[BUG] `DataFrame.to_arrow` is inconsistent with `pa.Table.from_pandas()` when `preserve_index=True` #14159

rjzamora · 2023-09-21T18:56:36Z

Describe the bug
It is my understanding that we want the DataFrame.to_arrow API to be consistent with pa.Table.from_pandas() (when possible). This is currently not the case when DataFrame.index is a RangeIndex, and preserve_index=True is specified. In this case, pa.Table.from_pandas() will use the RangeIndex information to produce an explicit "__index_level_0__" column in the output pyarrow Table.

Side Note: Creating an explicit column is the only way to generate a Table schema that will "safely" preserve the index in dask, because that schema may be used to "rebuild" partitions with a different number of rows later on (and so the original start/stop metadata can be "wrong").

Steps/Code to reproduce bug

In [1]: import pandas as pd, cudf
In [2]: import pyarrow as pa
In [3]: df = cudf.DataFrame({"x": ["cat", "dog"] * 5})
In [4]: pdf = df.to_pandas()

The schema extracted from pandas will have an "__index_level_0__" column when preserve_index=True, and will store RangeIndex metadata if preserve_index=True or preserve_index=None:

In [6]: pa.Table.from_pandas(pdf, preserve_index=None).schema
Out[6]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 362
In [7]: pa.Table.from_pandas(pdf, preserve_index=True).schema
Out[7]: 
x: string
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 431

The schema extracted from cudf will never have an explicit index column, and will only store RangeIndex metadata if preserve_index=True.

In [8]: df.to_arrow(preserve_index=None).schema
Out[8]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 177
In [9]: df.to_arrow(preserve_index=True).schema
Out[9]: 
x: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 362

Expected behavior
I'd like for DataFrame.to_arrow(preserve_index=True) to be consistent with pa.Table.from_pandas(..., preserve_index=True). More specifically, I'd like cudf to produce an explicit column in the pyarrow Table. When the original RangeIndex is un-named, it probably makes sense to call the column "__index_level_0__". However, DataFrame.from_arrow will also need to recognize that "__index_level_0__" should round-trip to an un-named index.

Additional context
This bug complicates #13893, because distributed's "p2p" shuffle now assumes that to_pyarrow_table_dispatch will include the actual data for an index column when preserve_index=True (not just range metadata). We can certainly include a band-aid for this in dask_cudf, but the best long-term fix belongs in cudf.

The text was updated successfully, but these errors were encountered:

When preserving the index and we have a RangeIndex, we must materialize it, and write that information in the metadata correctly. - Closes rapidsai#14159

Looks like these overrides should be safe to remove now that #14159 is closed out. This should unblock the GPU CI failures we're seeing on Dask with 24.06 in dask/dask#11045. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15514

rjzamora added bug Something isn't working Needs Triage Need team to review and classify labels Sep 21, 2023

rjzamora mentioned this issue Sep 21, 2023

Allow explicit shuffle="p2p" within dask-cudf API #13893

Merged

GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 9, 2023

wence- self-assigned this Mar 22, 2024

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#14159

3cc608b

wence- mentioned this issue Mar 22, 2024

Fix arrow-based round trip of empty dataframes #15373

Merged

3 tasks

wence- changed the title ~~[BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True~~ [BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True Mar 22, 2024

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#14159

75fcdf9

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#14159

999bb19

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add tests of rapidsai#12243 and rapidsai#14159

9bc0344

rapids-bot bot closed this as completed in #15373 Mar 23, 2024

rapids-bot bot closed this as completed in dda3f31 Mar 23, 2024

charlesbluca mentioned this issue Apr 11, 2024

Remove index name overrides in dask-cudf pyarrow table dispatch #15514

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `DataFrame.to_arrow` is inconsistent with `pa.Table.from_pandas()` when `preserve_index=True` #14159

[BUG] `DataFrame.to_arrow` is inconsistent with `pa.Table.from_pandas()` when `preserve_index=True` #14159

rjzamora commented Sep 21, 2023

[BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True #14159

[BUG] DataFrame.to_arrow is inconsistent with pa.Table.from_pandas() when preserve_index=True #14159

Comments

rjzamora commented Sep 21, 2023

[BUG] `DataFrame.to_arrow` is inconsistent with `pa.Table.from_pandas()` when `preserve_index=True` #14159

[BUG] `DataFrame.to_arrow` is inconsistent with `pa.Table.from_pandas()` when `preserve_index=True` #14159