-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Problem summary
When writing Spark tables to Lance with format version V2_1 or V2_2, struct columns (and therefore their nested fields) are written as NULL. The same write succeeds for V2_0. Root cause: Spark-side Arrow writer (LanceArrowWriter) does not set the parent StructVector validity bit when writing a non-null struct. Lance v2.1+ uses structural encoders that honor parent validity; missing parent-valid bits cause the structural encoders to skip child encoding and produce NULLs.
Minimal Steps to reproduce
- Start a Spark session with the Lance Spark connector on the classpath (use the connector built from this repo).
- Run (SQL or DataFrame equivalent):
-- Create temporary table with struct data using SQL
CREATE OR REPLACE TEMP VIEW test_data AS
SELECT
1 as id,
struct('value1' as field1, 'value2' as field2) as value
UNION ALL
SELECT
2 as id,
struct('value3' as field1, 'value4' as field2) as value;
-- Show the data
SELECT * FROM test_data;
-- Write to Parquet
CREATE TABLE parquet_table
USING parquet
LOCATION '/tmp/test_parquet'
AS SELECT * FROM test_data;
-- Read from Parquet
CREATE OR REPLACE TEMP VIEW parquet_data AS
SELECT * FROM parquet_table;
-- Set Lance Spark Catalog configuration
SET spark.sql.catalog.lance=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance.impl=dir;
SET spark.sql.catalog.lance.root=/tmp/lance/base;
SET spark.sql.catalog.lance.storage.data_storage_version=V2_0;
SET spark.sql.catalog.lance2=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance2.impl=dir;
SET spark.sql.catalog.lance2.root=/tmp/lance/test;
SET spark.sql.catalog.lance2.storage.data_storage_version=V2_2;
-- Drop table if exists
DROP TABLE IF EXISTS lance.default.lance_v20_test;
DROP TABLE IF EXISTS lance2.default.lance_v22_test;
-- Create Lance table
CREATE TABLE lance.default.lance_v20_test (
id INT,
value STRUCT<
field1: STRING,
field2: STRING
>
) TBLPROPERTIES ('data_storage_version'='V2_0');
-- Insert data into Lance table
INSERT INTO lance.default.lance_v20_test
SELECT id, value FROM parquet_data;
-- Query the Lance table to verify
SELECT * FROM lance.default.lance_v20_test;
-- Create Lance table
CREATE TABLE lance2.default.lance_v22_test (
id INT,
value STRUCT<
field1: STRING,
field2: STRING
>
) TBLPROPERTIES ('data_storage_version'='V2_2');
-- Insert data into Lance table
INSERT INTO lance2.default.lance_v22_test
SELECT id, value FROM parquet_data;
-- Query the Lance table to verify
SELECT * FROM lance2.default.lance_v22_test;Observed:
V2_0writes showvaluewith nested fields (non-NULL).V2_1/V2_2writes showvalueas NULL.
Root cause
- Structural encoders for v2.1+ check the parent's validity bitmap to decide whether to encode child fields. If the parent is considered null (via missing/false validity bit), child fields will not be encoded (reader sees NULL).
- v2.0 legacy encoders encoded nested children differently / were tolerant to missing parent-valid semantics; hence v2.0 appears to work.
Some Details
- Spark writes produce an Arrow
VectorSchemaRootand export it via Arrow C data interface to native code. The chain is:LanceDataWriter→LanceArrowWriter(JVM/Scala)- ArrowArrayStream export → JNI (fragment.rs ArrowArrayStreamReader)
- Rust:
FileFragment::create_fragments→ write pipeline →v2::writer::FileWriter(format-specific encoding)
- Lance file format encoding selection logic:
rust/lance-encoding/src/encoder.rs::default_encoding_strategy(version):V2_0→ legacy/CoreFieldEncodingStrategy (more tolerant)V2_1+→StructuralEncodingStrategy(structural encoders that rely on Arrow parent validity)
- writer.rs uses per-column encoders chosen by the above strategy.
- Problem in
LanceArrowWriter:StructWriterused to write SparkStructTypeto Arrow'sStructVectordid not explicitly mark the parentStructVectorslot as defined/non-null before writing children.- Example of the writer area (file): LanceArrowWriter.scala
- Before fix (concept):
setValue(...)wrote children but did not call any parent-valid marking;setNull()was empty. - Other nested writers (e.g., FixedSizeListWriter) mark parent validity (
valueVector.setNotNull(count)or equivalent).
- Before fix (concept):
Metadata
Metadata
Assignees
Labels
No labels