Spark partitioned tables #2083

tombaeyens · 2024-05-15T13:07:20Z

We were testing the schema validation with the Databricks connection, and we found a problem with partitioned tables.
SODA uses the columns # Partition Information and # col_name for the validation (check the first image).
We think this happens because of the table's describe (second image)
Is there anything that we can change on our side like a setting? Or is it a bug on SODA side that needs to be fixed?

The info for partition is irrelevant because the column appears in the first list and then in the partition information.

tools-soda · 2024-05-15T13:07:51Z

SAS-3465

tombaeyens · 2024-05-16T09:30:07Z

Potential fix:

In pyspark one can do this: partitions_columns = [col.name for col in spark.catalog.listColumns("schema_name.table_name") if col.isPartition] and non_paritions_columns = [col.name for col in spark.catalog.listColumns("schema_name.table_name") if not col.isPartition]

(source: https://stackoverflow.com/questions/51540906/how-to-get-the-hive-partition-column-name-using-spark )

Potentially it's suffice to apply the fix:

https://github.com/sodadata/soda-core/blob/main/soda/spark/soda/data_sources/spark_data_source.py#L213

and

https://github.com/sodadata/soda-core/blob/main/soda/spark/soda/data_sources/spark_data_source.py#L355

MoeAbdelKader · 2024-11-14T15:17:49Z

Hey @tombaeyens , is there an ETA for this one?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark partitioned tables #2083

Spark partitioned tables #2083

tombaeyens commented May 15, 2024 •

edited

Loading

tools-soda commented May 15, 2024

tombaeyens commented May 16, 2024

MoeAbdelKader commented Nov 14, 2024

Spark partitioned tables #2083

Spark partitioned tables #2083

Comments

tombaeyens commented May 15, 2024 • edited Loading

tools-soda commented May 15, 2024

tombaeyens commented May 16, 2024

MoeAbdelKader commented Nov 14, 2024

tombaeyens commented May 15, 2024 •

edited

Loading