Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][C++] Support month_day_nano_interval type in Parquet #36799

Open
FoxHeather opened this issue Jul 21, 2023 · 11 comments
Open

[Parquet][C++] Support month_day_nano_interval type in Parquet #36799

FoxHeather opened this issue Jul 21, 2023 · 11 comments

Comments

@FoxHeather
Copy link

FoxHeather commented Jul 21, 2023

Describe the usage question you have. Please include as many useful details as possible.

I want to generate a parquet file including type month_day_nano_interval.
This is my python code:

import pyarrow as pa
import pyarrow.parquet as pq

Define Schema

schema = pa.schema([
('itv', pa.month_day_nano_interval())
])

Prepare data

itv = pa.array([(1 , 15, -30),
(0 , 0, 0),
(13,25,1000),
(13,25,1000000),
(13,25,1000000000)
],
type = pa.month_day_nano_interval())

Generate Parquet data

batch = pa.RecordBatch.from_arrays( [itv], schema = schema )
table = pa.Table.from_batches([batch])

Write Parquet file pqtpitvl.parquet

pq.write_table(table, 'pqtpitvl.parquet')

it was failed and display error:
pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: month_day_nano_interval

@FoxHeather FoxHeather added the Type: usage Issue is a user question label Jul 21, 2023
@westonpace westonpace changed the title How to generate a parquet file with type month_day_nano_interval [Python] How to generate a parquet file with type month_day_nano_interval Jul 25, 2023
@westonpace
Copy link
Member

I don't know if parquet supports month/day/nano. It looks like it supports month/day/milli. CC @emkornfield who might know more details.

@FoxHeather
Copy link
Author

@mapleFU
Copy link
Member

mapleFU commented Jul 28, 2023

You can:

  1. Cast interval to time64, duration or other supported types, or you can use extension types
  2. Maybe we can cast it to https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval , but it need some further development

Currently, we doesn't support write arrow month_day_nano_interval to parquet file. See #36798 (comment)

@FoxHeather
Copy link
Author

This error was reported by me.
#36798 (comment)

@emkornfield
Copy link
Contributor

Yeah, Parquet does not have an analogous type. Besides for nanos vs nanoseconds, the type in parquet has each integer unsigned. I think we have three options to try to fix this:

  1. Ask parquet to introduce a new type.
  2. Allow for writing to existing logical type with validation that there is no data loss.
  3. Specifying a mapping on our own and use the stored arrow schema to resurrect the type properly. For the latter we should collaborate with other implementations. @alamb has Rust established a convention for mapping MONTH_DAY_NANOSECONDS to parquet?

@alamb
Copy link
Contributor

alamb commented Jul 30, 2023

@emkornfield I do not think Rust has established a convention yet (I don't think the rust parquet writer supports writing monthdaynano intervals yet):

https://github.com/apache/arrow-rs/blob/a31005605ead4b70bd89fa29bd09d7b1613636dc/parquet/src/arrow/arrow_writer/mod.rs#L1828-L1833

Maybe @tustvold has an opinion on what the mapping should be

@FoxHeather
Copy link
Author

This is document l founded:
Create instance of an interval type representing months, days and nanoseconds between two dates.
https://arrow.apache.org/docs/python/generated/pyarrow.month_day_nano_interval.html#pyarrow.month_day_nano_interval

pyarrow has type month_day_nano_interval

@mapleFU
Copy link
Member

mapleFU commented Aug 1, 2023

By the way, do we have document for arrow type and correspond parquet type mapping? That would make things more clear.

@alamb
Copy link
Contributor

alamb commented Aug 1, 2023

By the way, do we have document for arrow type and correspond parquet type mapping? That would make things more clear.

I don't think we have it documented anywhere other than the rust code itself, for example:

https://github.com/apache/arrow-rs/blob/a31005605ead4b70bd89fa29bd09d7b1613636dc/parquet/src/arrow/schema/primitive.rs

https://github.com/apache/arrow-rs/blob/a31005605ead4b70bd89fa29bd09d7b1613636dc/parquet/src/arrow/schema/complex.rs

@tustvold
Copy link
Contributor

tustvold commented Aug 1, 2023

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md is the canonical mapping, one thing worth highlighting here is that the parquet schema is authoritative, the embedded arrow schema is just a hint to provide additional information, see apache/arrow-rs#1663.

As such we would need a way to represent nanosecond intervals in parquet natively, before we could add support to arrow. The upstream ticket for this is - apache/parquet-format#43

@mapleFU
Copy link
Member

mapleFU commented Aug 2, 2023

Oh by the way, we can refer to this doc for parquet arrow mapping: https://github.com/apache/arrow/blob/main/docs/source/cpp/parquet.rst#logical-types

@jorisvandenbossche jorisvandenbossche changed the title [Python] How to generate a parquet file with type month_day_nano_interval [Parquet][C++] Support month_day_nano_interval type in Parquet Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants