Skip to content

Different behavior in datafusion 35.0.0 in reading hive-partitioned parquet data #579

@jwimberl

Description

@jwimberl

Describe the bug
pip recently switched to installing datafusion with version string '35.0.0'. Compared to a previous installation of version '34.0.0', creating an external table from hive-partitioned parquet data following the [https://arrow.apache.org/datafusion/user-guide/sql/ddl.html](documented instructions) does not work. While all the partition columns show up as columns of the table, the columns from the parquet data themselves do not appear.

To Reproduce

# prepare fake data
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
data = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
table = pa.Table.from_pandas(data)
import os
os.mkdir("fake=0")
pq.write_table(table,"./fake=0/data.parquet")

# load into datafusion
import datafusion as df
ctx = df.SessionContext()
ctx.sql("""
CREATE EXTERNAL TABLE data
STORED AS PARQUET
PARTITIONED BY (fake)
LOCATION './*/data.parquet'
""")

The loaded data is missing col1 and col2:

>>> ctx.sql("SELECT * FROM data")
DataFrame()
+------+
| fake |
+------+
| 0    |
| 0    |
+------+
>>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
DataFrame()
+------------+-------------+
| table_name | column_name |
+------------+-------------+
| data       | fake        |
+------------+-------------+

Expected behavior
The same steps with DataFusion 34.0.0 produce the following output:

>>> ctx.sql("SELECT * FROM data");
DataFrame()
+------+------+------+
| col1 | col2 | fake |
+------+------+------+
| 1    | 3    | 0    |
| 2    | 4    | 0    |
+------+------+------+
>>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
DataFrame()
+------------+-------------+
| table_name | column_name |
+------------+-------------+
| data       | col1        |
| data       | col2        |
| data       | fake        |
+------------+-------------+

Additional context
Operating system: Rocky 8
Python version: 3.10.11
DataFusion version: 35.0.0, recently installed via pip
pyarrow version: 15.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions