-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #4] Make writng of DataFrames more flexible #5
base: main
Are you sure you want to change the base?
Conversation
GKrivosheev-rms
commented
Feb 5, 2022
- Added overload extension with ability to write an IEnumerable to a single file
- Add ability to write DataFrame re-entrrantly to the same file
- Make Nullability of columns depending on underlying type (stable despite changes to row data)
- Added overload extension with ability to write an IEnumerable<DataFrame> to a single file - Add ability to write DataFrame re-entrrantly to the same file - Make Nullability of columns depending on underlying type (stable despite changes to row data)
/// <param name="path">Path to write to</param> | ||
/// <param name="writerProperties">Optional writer properties that override the default properties</param> | ||
/// <param name="logicalTypeOverrides">Mapping from column names to Parquet logical types, | ||
/// overriding the default logical types. When writing decimal columns, a logical type must be provided | ||
/// to specify the precision and scale to use.</param> | ||
/// <param name="rowGroupSize">Maximum number of rows per row group</param> | ||
public static void ToParquet( | ||
this DataFrame dataFrame, string path, WriterProperties? writerProperties = null, | ||
this IEnumerable<DataFrame> dataFrames, string path, WriterProperties? writerProperties = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
easy overload to write multiple dataframes. They must be of the same schema. this is currenlty not enforced.
TODO: ensure that schema is same in all dataframes in a sequence.
|
||
if (nullable && dataType.IsValueType) | ||
{ | ||
if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bug to change a schema of parquet depending on data within a particular row. If you are writingint?
it should always be a nullable column even if there are no actual null
values in that specific dataframe instance.. Some readers may expect exact type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame>
as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable
multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList
for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.
This change looks wrong though. When you have a DataFrameColumn
of Int32
for example, column.DataType
will be System.Int32
, not System.Nullable<System.Int32>
, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null
check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.
Can you also please add the surrounding curly braces back to match our code style.
@@ -8,58 +8,100 @@ namespace ParquetSharp | |||
public static class DataFrameExtensions | |||
{ | |||
/// <summary> | |||
/// Write a DataFrame in Parquet format | |||
/// Writes a Dataframe in Parquet format using existing writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 could you also please document this feature in the README file
|
||
if (nullable && dataType.IsValueType) | ||
{ | ||
if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame>
as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable
multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList
for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.
This change looks wrong though. When you have a DataFrameColumn
of Int32
for example, column.DataType
will be System.Int32
, not System.Nullable<System.Int32>
, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null
check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.
Can you also please add the surrounding curly braces back to match our code style.
|
||
var numRows = dataFrame.Rows.Count; | ||
var offset = 0L; | ||
DataFrame firstDataFrame = dataFrames.First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using First()
then a foreach
over dataFrames below requires multiple enumerations over the IEnumerable
, which won't be compatible with all IEnumerable
types. I think you'll probably want to create a null ParquetFileWriter
and then assign it on the first iteration of the loop.
Thanks for the PR, the approach looks good to me but I've left a couple of comments with some things that need addressing. Just a note about code formatting, our format check that runs in CI doesn't currently show the changes required, but you can manually run the formatter before committing to make sure the format will be compatible with:
|
Hi @GKrivosheev-rms, are you planning on following through with this PR? I think this feature would be quite useful so I'm happy to take over completing this change if you no longer have the time for this. |