Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PYTHON] Add nan_count to RowGroupMetaData #36068

Open
Fokko opened this issue Jun 14, 2023 · 3 comments
Open

[PYTHON] Add nan_count to RowGroupMetaData #36068

Fokko opened this issue Jun 14, 2023 · 3 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Jun 14, 2023

Describe the enhancement requested

Iceberg relies on statistics (called Metrics in Iceberg) to speed up the queries. Most of the metrics are available and can be easily extracted using the MetadataCollector, except for the NaN counts. If someone does an isNaN expression on a FLOAT/DOUBLE field, Iceberg tries to skip Parquet files by looking at the metrics that it has stored in the manifest files. It would be awesome if next to null_count also nan_count can be added:

Desktop python3 
Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
...                              "Brittle stars", "Centipede"]})
>>> metadata_collector = []
>>> import pyarrow.parquet as pq
>>> pq.write_to_dataset(
...     table, '/tmp/table',
...      metadata_collector=metadata_collector)
>>> metadata_collector
[<pyarrow._parquet.FileMetaData object at 0x11f955850>
  created_by: parquet-cpp-arrow version 11.0.0
  num_columns: 2
  num_rows: 6
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 0]

>>> metadata_collector[0].row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
  num_columns: 2
  num_rows: 6
  total_byte_size: 256

>>> metadata_collector[0].row_group(0).to_dict()
{
	'num_columns': 2,
	'num_rows': 6,
	'total_byte_size': 256,
	'columns': [{
		'file_offset': 119,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'INT64',
		'num_values': 6,
		'path_in_schema': 'n_legs',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 2,
			'max': 100,
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'INT64'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 4,
		'data_page_offset': 46,
		'total_compressed_size': 115,
		'total_uncompressed_size': 117
	}, {
		'file_offset': 359,
		'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
		'physical_type': 'BYTE_ARRAY',
		'num_values': 6,
		'path_in_schema': 'animal',
		'is_stats_set': True,
		'statistics': {
			'has_min_max': True,
			'min': 'Brittle stars',
			'max': 'Parrot',
			'null_count': 0,
			'distinct_count': 0,
			'num_values': 6,
			'physical_type': 'BYTE_ARRAY'
		},
		'compression': 'SNAPPY',
		'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
		'has_dictionary_page': True,
		'dictionary_page_offset': 215,
		'data_page_offset': 302,
		'total_compressed_size': 144,
		'total_uncompressed_size': 139
	}]
}

In addition to this, Parquet itself is also looking into this: apache/parquet-format#196

Component(s)

Python

@Fokko Fokko changed the title Add NaN count to [PYTHON] Add nan_count to RowGroupMetaData Jun 14, 2023
@mapleFU
Copy link
Member

mapleFU commented Jun 14, 2023

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L888
Seems currently we don't have a nan-count here?

@Fokko
Copy link
Contributor Author

Fokko commented Jun 14, 2023

@mapleFU That's right, that's being introduced in apache/parquet-format#196. In the Iceberg spec it is already there (and implemented many times), the field is called nan_value_counts.

@wgtmac
Copy link
Member

wgtmac commented Jun 15, 2023

@Fokko Did you have any comment or suggestion on the proposal you mentioned above? It would be great if it can align with the requirement from Apache Iceberg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants