Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #4] Make writng of DataFrames more flexible #5

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GKrivosheev-rms
Copy link

  • Added overload extension with ability to write an IEnumerable to a single file
  • Add ability to write DataFrame re-entrrantly to the same file
  • Make Nullability of columns depending on underlying type (stable despite changes to row data)

- Added overload extension with ability to write an IEnumerable<DataFrame>  to a single file
- Add ability to write DataFrame re-entrrantly to the same file
- Make Nullability of columns depending on underlying type (stable despite changes to row data)
/// <param name="path">Path to write to</param>
/// <param name="writerProperties">Optional writer properties that override the default properties</param>
/// <param name="logicalTypeOverrides">Mapping from column names to Parquet logical types,
/// overriding the default logical types. When writing decimal columns, a logical type must be provided
/// to specify the precision and scale to use.</param>
/// <param name="rowGroupSize">Maximum number of rows per row group</param>
public static void ToParquet(
this DataFrame dataFrame, string path, WriterProperties? writerProperties = null,
this IEnumerable<DataFrame> dataFrames, string path, WriterProperties? writerProperties = null,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easy overload to write multiple dataframes. They must be of the same schema. this is currenlty not enforced.
TODO: ensure that schema is same in all dataframes in a sequence.


if (nullable && dataType.IsValueType)
{
if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null)
Copy link
Author

@GKrivosheev-rms GKrivosheev-rms Feb 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bug to change a schema of parquet depending on data within a particular row. If you are writingint?it should always be a nullable column even if there are no actual null values in that specific dataframe instance.. Some readers may expect exact type.

Copy link
Contributor

@adamreeve adamreeve Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame> as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.

This change looks wrong though. When you have a DataFrameColumn of Int32 for example, column.DataType will be System.Int32, not System.Nullable<System.Int32>, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.

Can you also please add the surrounding curly braces back to match our code style.

@@ -8,58 +8,100 @@ namespace ParquetSharp
public static class DataFrameExtensions
{
/// <summary>
/// Write a DataFrame in Parquet format
/// Writes a Dataframe in Parquet format using existing writer.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 could you also please document this feature in the README file


if (nullable && dataType.IsValueType)
{
if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null)
Copy link
Contributor

@adamreeve adamreeve Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame> as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.

This change looks wrong though. When you have a DataFrameColumn of Int32 for example, column.DataType will be System.Int32, not System.Nullable<System.Int32>, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.

Can you also please add the surrounding curly braces back to match our code style.


var numRows = dataFrame.Rows.Count;
var offset = 0L;
DataFrame firstDataFrame = dataFrames.First();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using First() then a foreach over dataFrames below requires multiple enumerations over the IEnumerable, which won't be compatible with all IEnumerable types. I think you'll probably want to create a null ParquetFileWriter and then assign it on the first iteration of the loop.

@adamreeve
Copy link
Contributor

Thanks for the PR, the approach looks good to me but I've left a couple of comments with some things that need addressing.

Just a note about code formatting, our format check that runs in CI doesn't currently show the changes required, but you can manually run the formatter before committing to make sure the format will be compatible with:

dotnet tool restore
dotnet jb cleanupcode --profile="Built-in: Reformat Code" --settings="ParquetSharp.DataFrame.DotSettings" --verbosity=WARN "ParquetSharp.DataFrame" "ParquetSharp.DataFrame.Test"

@adamreeve
Copy link
Contributor

Hi @GKrivosheev-rms, are you planning on following through with this PR? I think this feature would be quite useful so I'm happy to take over completing this change if you no longer have the time for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants