You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, a single batch of parquet files can contain parquet files with different columns. For example, if there are 1000 events per file, but context_1 is not seen until the 1001th event, then the first parquet file is missing the context_1 column.
For Databricks loading, we account for this by setting the MERGESCHEMA format option. This works, but it is not ideal because it is slightly inefficient.
When we add BigQuery into RDB loader, then this is going to be a bigger problem. For BigQuery, the load statement looks something like:
LOAD DATA INTO atomic.events
FROM FILES(
format='PARQUET',
uris = ['gs://bucket/path/to/batch'],
enable_list_inference = true
)
This load is successful, but when you query the table you find it is missing data for columns that were not present in every parquet file. It's because BigQuery only checks the first parquet file for the schema.
I have two suggested implementations, I don't know yet which is better.
Option 1: The transformer could emit a batch early if it sees a new schema for the first time. For example, if there are 1000 events per file, and the 1001th event contains context_1, then it emits the first 1000 events as a single batch without context_1, even if the 5 minute window has not completed yet. See #1198 which is relevant for this.
Option 2: Elsewhere, we have been experimenting with using a local spark context to write the parquet file. It spills pending events to local disk, and only starts creating the output file once the window is complete. If we change the transformer to use this approach, then it also solves the parquet schema problem. However, it means the transformer needs access to disk, and this could add expense and complexity for some deployments.
The text was updated successfully, but these errors were encountered:
Currently, a single batch of parquet files can contain parquet files with different columns. For example, if there are 1000 events per file, but
context_1
is not seen until the 1001th event, then the first parquet file is missing thecontext_1
column.For Databricks loading, we account for this by setting the MERGESCHEMA format option. This works, but it is not ideal because it is slightly inefficient.
When we add BigQuery into RDB loader, then this is going to be a bigger problem. For BigQuery, the load statement looks something like:
This load is successful, but when you query the table you find it is missing data for columns that were not present in every parquet file. It's because BigQuery only checks the first parquet file for the schema.
I have two suggested implementations, I don't know yet which is better.
Option 1: The transformer could emit a batch early if it sees a new schema for the first time. For example, if there are 1000 events per file, and the 1001th event contains
context_1
, then it emits the first 1000 events as a single batch withoutcontext_1
, even if the 5 minute window has not completed yet. See #1198 which is relevant for this.Option 2: Elsewhere, we have been experimenting with using a local spark context to write the parquet file. It spills pending events to local disk, and only starts creating the output file once the window is complete. If we change the transformer to use this approach, then it also solves the parquet schema problem. However, it means the transformer needs access to disk, and this could add expense and complexity for some deployments.
The text was updated successfully, but these errors were encountered: