New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Issue #4] Make writng of DataFrames more flexible #5

Draft

GKrivosheev-rms wants to merge 2 commits into G-Research:main from GKrivosheev-rms:patch-1

GKrivosheev-rms commented Feb 5, 2022

Added overload extension with ability to write an IEnumerable to a single file
Add ability to write DataFrame re-entrrantly to the same file
Make Nullability of columns depending on underlying type (stable despite changes to row data)

GKrivosheev-rms added 2 commits

February 5, 2022 00:03


          [Issue G-Research#4] Make writng of DataFrames more flexible

d20460b

- Added overload extension with ability to write an IEnumerable<DataFrame>  to a single file
- Add ability to write DataFrame re-entrrantly to the same file
- Make Nullability of columns depending on underlying type (stable despite changes to row data)


          Turn the write into an extension method.

e185776

GKrivosheev-rms commented

View reviewed changes

ParquetSharp.DataFrame/DataFrameExtensions.cs

    
                      /// <param name="path">Path to write to</param>

                      /// <param name="writerProperties">Optional writer properties that override the default properties</param>

                      /// <param name="logicalTypeOverrides">Mapping from column names to Parquet logical types,

                      /// overriding the default logical types. When writing decimal columns, a logical type must be provided

                      /// to specify the precision and scale to use.</param>

                      /// <param name="rowGroupSize">Maximum number of rows per row group</param>

                      public static void ToParquet(

                          this DataFrame dataFrame, string path, WriterProperties? writerProperties = null,

                          this IEnumerable<DataFrame> dataFrames, string path, WriterProperties? writerProperties = null,

Author

GKrivosheev-rms Feb 5, 2022

easy overload to write multiple dataframes. They must be of the same schema. this is currenlty not enforced.
TODO: ensure that schema is same in all dataframes in a sequence.

GKrivosheev-rms commented

View reviewed changes

ParquetSharp.DataFrame/DataFrameExtensions.cs

-                          if (nullable && dataType.IsValueType)
-                          {
+                          if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null)

Author

GKrivosheev-rms Feb 5, 2022 •

edited

Loading

I think it's a bug to change a schema of parquet depending on data within a particular row. If you are writingint?it should always be a nullable column even if there are no actual null values in that specific dataframe instance.. Some readers may expect exact type.

Contributor

adamreeve Feb 7, 2022 •

edited

Loading

Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame> as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.

This change looks wrong though. When you have a DataFrameColumn of Int32 for example, column.DataType will be System.Int32, not System.Nullable<System.Int32>, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.

Can you also please add the surrounding curly braces back to match our code style.

GKrivosheev-rms commented

View reviewed changes

ParquetSharp.DataFrame/DataFrameExtensions.cs

    
            @@ -8,58 +8,100 @@ namespace ParquetSharp
          
                  public static class DataFrameExtensions

                  {

                      /// <summary>

                      /// Write a DataFrame in Parquet format

                      /// Writes a Dataframe in Parquet format using existing writer.

Author

GKrivosheev-rms Feb 5, 2022

TODO: add unit tests.

Contributor

adamreeve Feb 7, 2022

👍 could you also please document this feature in the README file

GKrivosheev-rms mentioned this pull request

Need to be able to write IEnumerable<DataFrame> to disk #4

Open

adamreeve requested changes

View reviewed changes

ParquetSharp.DataFrame/DataFrameExtensions.cs

-                          if (nullable && dataType.IsValueType)
-                          {
+                          if (dataType.IsValueType && Nullable.GetUnderlyingType(dataType) != null)

Contributor

adamreeve Feb 7, 2022 •

edited

Loading

Yes I think it makes sense to always write value types as nullable, as DataFrame columns are always nullable. This is going to be required to support writing an IEnumerable<DataFrame> as otherwise there could be a scenario where the first file has no nulls but a subsequent file has nulls which would cause a crash, and we can't iterate over the IEnumerable multiple times to check for nulls unless we were more restrictive and only allowed writing IReadOnlyList for example. But making the type nullable depending on the null count was something I wasn't sure about when I wrote this and this scenario is a good example of why we should just always write value types as nullable.

This change looks wrong though. When you have a DataFrameColumn of Int32 for example, column.DataType will be System.Int32, not System.Nullable<System.Int32>, even if the column contains null values. So you need to remove the Nullable.GetUnderlyingType(dataType) != null check as that will always be false. I think you'll find that this is causing some test failures, and you'll also need to update tests that expect non-nullable columns.

Can you also please add the surrounding curly braces back to match our code style.

ParquetSharp.DataFrame/DataFrameExtensions.cs

-                          var numRows = dataFrame.Rows.Count;
-                          var offset = 0L;
+                          DataFrame firstDataFrame = dataFrames.First();

Contributor

adamreeve Feb 7, 2022

Using First() then a foreach over dataFrames below requires multiple enumerations over the IEnumerable, which won't be compatible with all IEnumerable types. I think you'll probably want to create a null ParquetFileWriter and then assign it on the first iteration of the loop.

Contributor

adamreeve commented Feb 7, 2022

Thanks for the PR, the approach looks good to me but I've left a couple of comments with some things that need addressing.

Just a note about code formatting, our format check that runs in CI doesn't currently show the changes required, but you can manually run the formatter before committing to make sure the format will be compatible with:

dotnet tool restore
dotnet jb cleanupcode --profile="Built-in: Reformat Code" --settings="ParquetSharp.DataFrame.DotSettings" --verbosity=WARN "ParquetSharp.DataFrame" "ParquetSharp.DataFrame.Test"

Contributor

adamreeve commented Mar 14, 2022

Hi @GKrivosheev-rms, are you planning on following through with this PR? I think this feature would be quite useful so I'm happy to take over completing this change if you no longer have the time for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet