BUG: struct columns written for lance format v2.1/v2.2 is always null

## Problem summary
When writing Spark tables to Lance with format version `V2_1` or `V2_2`, struct columns (and therefore their nested fields) are written as NULL. The same write succeeds for `V2_0`. Root cause: Spark-side Arrow writer (`LanceArrowWriter`) does not set the parent `StructVector` validity bit when writing a non-null struct. Lance v2.1+ uses structural encoders that honor parent validity; missing parent-valid bits cause the structural encoders to skip child encoding and produce NULLs.
## Minimal Steps to reproduce
1. Start a Spark session with the Lance Spark connector on the classpath (use the connector built from this repo).
2. Run (SQL or DataFrame equivalent):
```sql
-- Create temporary table with struct data using SQL
CREATE OR REPLACE TEMP VIEW test_data AS
SELECT 
    1 as id,
    struct('value1' as field1, 'value2' as field2) as value
UNION ALL
SELECT 
    2 as id,
    struct('value3' as field1, 'value4' as field2) as value;
 
-- Show the data
SELECT * FROM test_data;
 
-- Write to Parquet
CREATE TABLE parquet_table
USING parquet
LOCATION '/tmp/test_parquet'
AS SELECT * FROM test_data;
 
-- Read from Parquet
CREATE OR REPLACE TEMP VIEW parquet_data AS
SELECT * FROM parquet_table;
 
 
-- Set Lance Spark Catalog configuration
SET spark.sql.catalog.lance=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance.impl=dir;
SET spark.sql.catalog.lance.root=/tmp/lance/base;
SET spark.sql.catalog.lance.storage.data_storage_version=V2_0;
 
SET spark.sql.catalog.lance2=com.lancedb.lance.spark.LanceNamespaceSparkCatalog;
SET spark.sql.catalog.lance2.impl=dir;
SET spark.sql.catalog.lance2.root=/tmp/lance/test;
SET spark.sql.catalog.lance2.storage.data_storage_version=V2_2;
 
-- Drop table if exists
DROP TABLE IF EXISTS lance.default.lance_v20_test;
DROP TABLE IF EXISTS lance2.default.lance_v22_test;
 
-- Create Lance table
CREATE TABLE lance.default.lance_v20_test (
    id INT,
    value STRUCT<
        field1: STRING,
        field2: STRING
    >
) TBLPROPERTIES ('data_storage_version'='V2_0');
 
-- Insert data into Lance table
INSERT INTO lance.default.lance_v20_test
SELECT id, value FROM parquet_data;
 
-- Query the Lance table to verify
SELECT * FROM lance.default.lance_v20_test;
 
-- Create Lance table
CREATE TABLE lance2.default.lance_v22_test (
    id INT,
    value STRUCT<
        field1: STRING,
        field2: STRING
    >
) TBLPROPERTIES ('data_storage_version'='V2_2');
 
-- Insert data into Lance table
INSERT INTO lance2.default.lance_v22_test
SELECT id, value FROM parquet_data;
 
-- Query the Lance table to verify
SELECT * FROM lance2.default.lance_v22_test;
```
Observed:
- `V2_0` writes show `value` with nested fields (non-NULL).
- `V2_1`/`V2_2` writes show `value` as NULL.
## Root cause
- Structural encoders for v2.1+ check the parent's validity bitmap to decide whether to encode child fields. If the parent is considered null (via missing/false validity bit), child fields will not be encoded (reader sees NULL).
- v2.0 legacy encoders encoded nested children differently / were tolerant to missing parent-valid semantics; hence v2.0 appears to work.


## Some Details

- Spark writes produce an Arrow `VectorSchemaRoot` and export it via Arrow C data interface to native code. The chain is:
  - `LanceDataWriter` → `LanceArrowWriter` (JVM/Scala)
  - ArrowArrayStream export → JNI (fragment.rs ArrowArrayStreamReader)
  - Rust: `FileFragment::create_fragments` → write pipeline → `v2::writer::FileWriter` (format-specific encoding)
- Lance file format encoding selection logic:
  - `rust/lance-encoding/src/encoder.rs::default_encoding_strategy(version)`:
    - `V2_0` → legacy/CoreFieldEncodingStrategy (more tolerant)
    - `V2_1+` → `StructuralEncodingStrategy` (structural encoders that rely on Arrow parent validity)
  - writer.rs uses per-column encoders chosen by the above strategy.
- Problem in `LanceArrowWriter`:
  - `StructWriter` used to write Spark `StructType` to Arrow's `StructVector` did not explicitly mark the parent `StructVector` slot as defined/non-null before writing children.
  - Example of the writer area (file): LanceArrowWriter.scala
    - Before fix (concept): `setValue(...)` wrote children but did not call any parent-valid marking; `setNull()` was empty.
    - Other nested writers (e.g., FixedSizeListWriter) mark parent validity (`valueVector.setNotNull(count)` or equivalent).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: struct columns written for lance format v2.1/v2.2 is always null #119

Problem summary

Minimal Steps to reproduce

Root cause

Some Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: struct columns written for lance format v2.1/v2.2 is always null #119

Description

Problem summary

Minimal Steps to reproduce

Root cause

Some Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions