Skip to content

BUG: struct columns written for lance format v2.1/v2.2 is always null #119

@qidian99

Description

@qidian99

Problem summary

When writing Spark tables to Lance with format version V2_1 or V2_2, struct columns (and therefore their nested fields) are written as NULL. The same write succeeds for V2_0. Root cause: Spark-side Arrow writer (LanceArrowWriter) does not set the parent StructVector validity bit when writing a non-null struct. Lance v2.1+ uses structural encoders that honor parent validity; missing parent-valid bits cause the structural encoders to skip child encoding and produce NULLs.

Minimal Steps to reproduce

  1. Start a Spark session with the Lance Spark connector on the classpath (use the connector built from this repo).
  2. Run (SQL or DataFrame equivalent):
-- Create temporary table with struct data using SQL
CREATE OR REPLACE TEMP VIEW test_data AS
SELECT 
    1 as id,
    struct('value1' as field1, 'value2' as field2) as value
UNION ALL
SELECT 
    2 as id,
    struct('value3' as field1, 'value4' as field2) as value;
 
-- Show the data
SELECT * FROM test_data;
 
-- Write to Parquet
CREATE TABLE parquet_table
USING parquet
LOCATION '/tmp/test_parquet'
AS SELECT * FROM test_data;
 
-- Read from Parquet
CREATE OR REPLACE TEMP VIEW parquet_data AS
SELECT * FROM parquet_table;
 
 
-- Set Lance Spark Catalog configuration
SET spark.sql.catalog.lance=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance.impl=dir;
SET spark.sql.catalog.lance.root=/tmp/lance/base;
SET spark.sql.catalog.lance.storage.data_storage_version=V2_0;
 
SET spark.sql.catalog.lance2=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance2.impl=dir;
SET spark.sql.catalog.lance2.root=/tmp/lance/test;
SET spark.sql.catalog.lance2.storage.data_storage_version=V2_2;
 
-- Drop table if exists
DROP TABLE IF EXISTS lance.default.lance_v20_test;
DROP TABLE IF EXISTS lance2.default.lance_v22_test;
 
-- Create Lance table
CREATE TABLE lance.default.lance_v20_test (
    id INT,
    value STRUCT<
        field1: STRING,
        field2: STRING
    >
) TBLPROPERTIES ('data_storage_version'='V2_0');
 
-- Insert data into Lance table
INSERT INTO lance.default.lance_v20_test
SELECT id, value FROM parquet_data;
 
-- Query the Lance table to verify
SELECT * FROM lance.default.lance_v20_test;
 
-- Create Lance table
CREATE TABLE lance2.default.lance_v22_test (
    id INT,
    value STRUCT<
        field1: STRING,
        field2: STRING
    >
) TBLPROPERTIES ('data_storage_version'='V2_2');
 
-- Insert data into Lance table
INSERT INTO lance2.default.lance_v22_test
SELECT id, value FROM parquet_data;
 
-- Query the Lance table to verify
SELECT * FROM lance2.default.lance_v22_test;

Observed:

  • V2_0 writes show value with nested fields (non-NULL).
  • V2_1/V2_2 writes show value as NULL.

Root cause

  • Structural encoders for v2.1+ check the parent's validity bitmap to decide whether to encode child fields. If the parent is considered null (via missing/false validity bit), child fields will not be encoded (reader sees NULL).
  • v2.0 legacy encoders encoded nested children differently / were tolerant to missing parent-valid semantics; hence v2.0 appears to work.

Some Details

  • Spark writes produce an Arrow VectorSchemaRoot and export it via Arrow C data interface to native code. The chain is:
    • LanceDataWriterLanceArrowWriter (JVM/Scala)
    • ArrowArrayStream export → JNI (fragment.rs ArrowArrayStreamReader)
    • Rust: FileFragment::create_fragments → write pipeline → v2::writer::FileWriter (format-specific encoding)
  • Lance file format encoding selection logic:
    • rust/lance-encoding/src/encoder.rs::default_encoding_strategy(version):
      • V2_0 → legacy/CoreFieldEncodingStrategy (more tolerant)
      • V2_1+StructuralEncodingStrategy (structural encoders that rely on Arrow parent validity)
    • writer.rs uses per-column encoders chosen by the above strategy.
  • Problem in LanceArrowWriter:
    • StructWriter used to write Spark StructType to Arrow's StructVector did not explicitly mark the parent StructVector slot as defined/non-null before writing children.
    • Example of the writer area (file): LanceArrowWriter.scala
      • Before fix (concept): setValue(...) wrote children but did not call any parent-valid marking; setNull() was empty.
      • Other nested writers (e.g., FixedSizeListWriter) mark parent validity (valueVector.setNotNull(count) or equivalent).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions