Commit 37ee992
[SPARK-53535][SQL] Fix missing structs always being assumed as nulls
### What changes were proposed in this pull request?
Currently, if all fields of a struct mentioned in the read schema are missing in a Parquet file, the reader populates the struct with nulls.
This PR modifies the scan behavior so that if the struct exists in the Parquet schema but none of the fields from the read schema are present, we instead pick an arbitrary field from the Parquet file to read and use that to populate NULLs (as well as outer NULLs and array sizes if the struct is nested in another nested type).
This is done by changing the schema requested by the readers. We add an additional field to the requested schema when clipping the Parquet file schema according to the Spark schema. This means that the readers actually read and return more data than requested, which can cause problems. This is only a problem for the `VectorizedParquetRecordReader`, since for the other read code path via parquet-mr, we already have an `UnsafeProjection` for outputting only requested schema fields in `ParquetFileFormat`.
To ensure `VectorizedParquetRecordReader` only returns Spark requested fields, we create the `ColumnarBatch` with vectors that match the requested schema (we get rid of the additional fields by recursively matching `sparkSchema` with `sparkRequestedSchema` and ensuring structs have the same length in both). Then `ParquetColumnVector`s are responsible for allocating dummy vectors to hold the data temporarily while reading, but these are not exposed to the outside.
The heuristic to pick the arbitrary leaf field is as follows: We try to minimize the amount of arrays or maps (repeated fields) in the path to a leaf column, because the more repeated fields we have the more likely we are to read larger amount of data. At the same repetition level, we consider the type of each column to pick the cheapest column to read (struct nesting do not affect the decision here). We look at the byte size of the column type to pick the cheapest one as follows:
- BOOLEAN: 1 byte
- INT32, FLOAT: 4 bytes
- INT64, DOUBLE: 8 bytes
- INT96: 12 bytes
- BINARY, FIXED_LEN_BYTE_ARRAY, default case for future types: 32 bytes (high cost due to variable/large size)
### Why are the changes needed?
This is a bug fix, because we were incorrectly assuming non-null struct values to be missing from the file depending on requested fields and returning null values.
### Does this PR introduce _any_ user-facing change?
Yes. We previously assumed structs to be null if all the fields we are trying to read from a Parquet file were missing from that file, even if the file contained other fields that could be used to take definition levels from. See an example from the Jira ticket below:
```python
df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)
df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
spark.read.format("parquet").schema(df_b.schema).load(path).show()
```
This used to return:
```
+---+----+
| id| s|
+---+----+
| 1|NULL|
+---+----+
```
It now returns:
```
+---+------+
| id| s|
+---+------+
| 1|{NULL}|
+---+------+
```
### How was this patch tested?
Added new unit tests, also fixed an old test to expect this new behavior.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #52557 from ZiyaZa/missing_struct.
Authored-by: Ziya Mukhtarov <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent 1f21a8b commit 37ee992
File tree
15 files changed
+898
-121
lines changed- docs
- sql
- catalyst/src/main/scala/org/apache/spark/sql/internal
- core/src
- main
- java/org/apache/spark/sql/execution
- datasources/parquet
- vectorized
- scala/org/apache/spark/sql/execution/datasources/parquet
- test/scala/org/apache/spark/sql/execution/datasources/parquet
15 files changed
+898
-121
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
25 | 29 | | |
26 | 30 | | |
27 | 31 | | |
| |||
Lines changed: 12 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1534 | 1534 | | |
1535 | 1535 | | |
1536 | 1536 | | |
| 1537 | + | |
| 1538 | + | |
| 1539 | + | |
| 1540 | + | |
| 1541 | + | |
| 1542 | + | |
| 1543 | + | |
| 1544 | + | |
| 1545 | + | |
| 1546 | + | |
| 1547 | + | |
| 1548 | + | |
1537 | 1549 | | |
1538 | 1550 | | |
1539 | 1551 | | |
| |||
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java
Lines changed: 18 additions & 30 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
25 | 24 | | |
26 | | - | |
27 | | - | |
28 | 25 | | |
29 | 26 | | |
30 | 27 | | |
31 | | - | |
32 | 28 | | |
33 | 29 | | |
34 | 30 | | |
| |||
69 | 65 | | |
70 | 66 | | |
71 | 67 | | |
72 | | - | |
73 | 68 | | |
74 | 69 | | |
75 | 70 | | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | 71 | | |
83 | 72 | | |
84 | 73 | | |
| |||
111 | 100 | | |
112 | 101 | | |
113 | 102 | | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
119 | 107 | | |
120 | 108 | | |
121 | 109 | | |
122 | 110 | | |
123 | 111 | | |
124 | 112 | | |
125 | 113 | | |
126 | | - | |
| 114 | + | |
127 | 115 | | |
128 | 116 | | |
129 | 117 | | |
130 | | - | |
| 118 | + | |
131 | 119 | | |
132 | 120 | | |
133 | | - | |
134 | | - | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
135 | 127 | | |
136 | 128 | | |
137 | 129 | | |
138 | | - | |
139 | | - | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
140 | 136 | | |
141 | 137 | | |
142 | | - | |
143 | 138 | | |
144 | 139 | | |
145 | 140 | | |
| |||
375 | 370 | | |
376 | 371 | | |
377 | 372 | | |
378 | | - | |
379 | | - | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | | - | |
384 | | - | |
385 | 373 | | |
386 | 374 | | |
387 | 375 | | |
| |||
Lines changed: 22 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
| 91 | + | |
90 | 92 | | |
91 | 93 | | |
92 | 94 | | |
| |||
99 | 101 | | |
100 | 102 | | |
101 | 103 | | |
102 | | - | |
| 104 | + | |
103 | 105 | | |
104 | 106 | | |
105 | 107 | | |
| |||
164 | 166 | | |
165 | 167 | | |
166 | 168 | | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
173 | 175 | | |
174 | 176 | | |
175 | | - | |
| 177 | + | |
176 | 178 | | |
177 | 179 | | |
178 | | - | |
| 180 | + | |
179 | 181 | | |
180 | 182 | | |
181 | 183 | | |
182 | | - | |
| 184 | + | |
183 | 185 | | |
184 | 186 | | |
185 | 187 | | |
| |||
201 | 203 | | |
202 | 204 | | |
203 | 205 | | |
204 | | - | |
| 206 | + | |
205 | 207 | | |
206 | 208 | | |
| 209 | + | |
207 | 210 | | |
208 | 211 | | |
209 | 212 | | |
| |||
216 | 219 | | |
217 | 220 | | |
218 | 221 | | |
219 | | - | |
220 | | - | |
221 | | - | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
226 | 229 | | |
227 | 230 | | |
| 231 | + | |
228 | 232 | | |
229 | 233 | | |
230 | 234 | | |
| |||
Lines changed: 65 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
| 52 | + | |
51 | 53 | | |
52 | 54 | | |
53 | | - | |
54 | | - | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| |||
265 | 265 | | |
266 | 266 | | |
267 | 267 | | |
268 | | - | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
269 | 277 | | |
270 | 278 | | |
271 | 279 | | |
| |||
287 | 295 | | |
288 | 296 | | |
289 | 297 | | |
290 | | - | |
| 298 | + | |
| 299 | + | |
291 | 300 | | |
292 | 301 | | |
293 | 302 | | |
| |||
309 | 318 | | |
310 | 319 | | |
311 | 320 | | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
312 | 373 | | |
313 | 374 | | |
314 | 375 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
633 | 633 | | |
634 | 634 | | |
635 | 635 | | |
636 | | - | |
| 636 | + | |
637 | 637 | | |
638 | 638 | | |
639 | 639 | | |
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
646 | 646 | | |
647 | 647 | | |
648 | 648 | | |
649 | | - | |
| 649 | + | |
650 | 650 | | |
651 | 651 | | |
652 | 652 | | |
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
944 | 944 | | |
945 | 945 | | |
946 | 946 | | |
947 | | - | |
| 947 | + | |
948 | 948 | | |
949 | 949 | | |
950 | 950 | | |
| |||
0 commit comments