Optimizer remove unnecessary projections #232

Tmonster · 2025-01-27T07:29:07Z

Needed for reordering left joins

…ns into the local table storage

… finished yet

# Conflicts: # src/execution/operator/persistent/physical_insert.cpp

* For small probe cardinalities, plan a nested loop join + aggregate * Add asof_loop_join_threshold setting to control its use.

* Add asof_loop_join_threshold loops to tests.

…or verification (duckdb#16262) This PR fixes duckdblabs/duckdb-internal#4233 `Execute` complained because the validity mask was dirty, using a vector cache to reset the Vector before every Execute call now.

…er (duckdb#16263) Currently we keep the compressed buffer around, when we really only need it for a short amount of time (when we read data to then decompress it). Since we keep the compressed buffer around per column, these resizeable buffers can grow to a rather substantial amount of memory, which can lead to us using 50-100% extra memory when reading Parquet files. Destroying and re-allocating the buffer does not seem to greatly affect performance - below are ClickBench timings ran on my laptop on a single big file (which is when caching matters the most given that caches are re-instantiated between files anyway): We can perhaps do something more clever where we keep around a single cache that we share across readers, but not for the bug-fix release. | Query | Old | New | Ratio | |-------|------:|------:|------:| | Q00 | 0.165 | 0.156 | 0.95 | | Q01 | 0.156 | 0.155 | 0.99 | | Q02 | 0.192 | 0.191 | 0.99 | | Q03 | 0.179 | 0.188 | 1.05 | | Q04 | 0.292 | 0.287 | 0.98 | | Q05 | 0.364 | 0.349 | 0.96 | | Q06 | 0.154 | 0.157 | 1.02 | | Q07 | 0.158 | 0.159 | 1.01 | | Q08 | 0.361 | 0.348 | 0.96 | | Q09 | 0.449 | 0.440 | 0.98 | | Q10 | 0.225 | 0.220 | 0.98 | | Q11 | 0.251 | 0.249 | 0.99 | | Q12 | 0.379 | 0.355 | 0.94 | | Q13 | 0.552 | 0.539 | 0.98 | | Q14 | 0.426 | 0.412 | 0.97 | | Q15 | 0.351 | 0.325 | 0.93 | | Q16 | 0.665 | 0.629 | 0.95 | | Q17 | 0.617 | 0.596 | 0.97 | | Q18 | 0.971 | 0.947 | 0.98 | | Q19 | 0.215 | 0.196 | 0.91 | | Q20 | 0.929 | 0.892 | 0.96 | | Q21 | 0.730 | 0.719 | 0.98 | | Q22 | 1.213 | 1.226 | 1.01 | | Q23 | 3.062 | 2.905 | 0.95 | | Q24 | 0.189 | 0.189 | 1.00 | | Q25 | 0.178 | 0.172 | 0.97 | | Q26 | 0.201 | 0.186 | 0.93 | | Q27 | 0.771 | 0.729 | 0.95 | | Q28 | 7.475 | 7.246 | 0.97 | | Q29 | 0.254 | 0.236 | 0.93 | | Q30 | 0.483 | 0.477 | 0.99 | | Q31 | 0.508 | 0.549 | 1.08 | | Q32 | 1.168 | 1.254 | 1.07 | | Q33 | 1.431 | 1.493 | 1.04 | | Q34 | 1.453 | 1.511 | 1.04 | | Q35 | 0.359 | 0.369 | 1.03 | | Q36 | 0.165 | 0.184 | 1.12 | | Q37 | 0.161 | 0.166 | 1.03 | | Q38 | 0.173 | 0.172 | 0.99 | | Q39 | 0.177 | 0.182 | 1.03 | | Q40 | 0.165 | 0.169 | 1.02 | | Q41 | 0.162 | 0.172 | 1.06 |

…16264) Fixes duckdb#16260

@cfis

…n the compiler (MSVC or not) (duckdb#16267) This PR addresses duckdb#16237 My local builds started failing because `-std=c++11` was missing, this should fix the issue for both flavors of compiler. Please verify that this doesn't break the MSVC environment @cfis

output from `duckdb_extensions` table function whould show `NULL` as install mode instead of `NOT_INSTALLED`

…nemap (duckdb#16269) During row group checks - we keep track of which filters are always true (and hence do not need to be checked during the actual scan). Due to a missing `else` statement, we could call this method twice for the same filter. Now this doesn't look problematic, but in `SetFilterAlwaysTrue` we *also* keep track of how many filters have been set as being always true - and if all filters are set as being always true, the filters are skipped entirely. This can cause problems when optional filters (as e.g. constructed due to a Top-N) are combined with non-optional filters (as e.g. constructed through a `WHERE` clause). The filter would not be correctly executed - leading to incorrect rows sneaking into the result.

Fixes duckdb#16074

…lias instead of by index (duckdb#16272) Fix duckdb#16231

* Check hints for equality and skip search. fixes: duckdb#16250 fixes: duckdblabs/duckdb-internal#4229

Fix duckdb#16257 I've attached the diff that needs to happen to `main` when we eventually need to merge this branch into it, as the Parquet writer has been refactored significantly in the meantime. [page_fix.txt](https://github.com/user-attachments/files/18827335/page_fix.txt)

Basically cherry-picking the commits of duckdb#16154 to the `v1.2` branch, and reducing the number of tests to a minimal set.

1. Slightly improve the performance of hashing long strings 2. Implement bit-packing for our `RleBpEncoder` (it only did RLE until now) 3. Implement a custom hash map to build Parquet dictionaries (instead of using `std::unordered_map`) 4. Add many fast paths, e.g., for handling `NULL` values when we know there are no `NULL`s This brings the time it takes to write `lineitem` from TPC-H SF10 to Parquet from ~4.5s to ~2.5s on my laptop, as well as reducing the size from 2.4 GB to 2.0 GB (with the default `PARQUET_VERSION V1`). When enabling `PARQUET_VERSION V2`, time reduces from ~3.8s to ~2.4s on my laptop, and size reduces from 1.7 GB to 1.6 GB. It also adds the parameter `string_dictionary_page_size_limit` to the Parquet writer, which makes the page size of dictionary pages configurable. It defaults to 1 MB.

…nchlessly

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

… function is changed again

These changes affect bugs related to optimistically writing blocks via the partial block manager. - `LocalTableStorage::Rollback` has to `row_groups->CommitDropTable()`; otherwise, we'll have no way to figure out which blocks have to be marked modified on rollback if the partial block manager is already destroyed. - `PartialBlockManager::Rollback(const bool mark_modified);`: The top-level optimistic writer directly rolls back its row groups (now). Child writers (and their partial block managers) need to mark their written blocks as modified. - `LocalTableStorage` + `ALTER TYPE`: `row_groups->CommitDropColumn` the (old) altered column to add its optimistically written blocks to the modified list. - `LocalTableStorage` + `DROP COLUMN`: `row_groups->CommitDropColumn` the dropped column to add its optimistically written blocks to the modified list. - `PartialBlockManager::FlushPartialBlocks` needs to add its partial flushed blocks to `written_blocks`. Otherwise, in some rare cases (successful insert, constraint violation on other data, then rollback in a transaction), we cannot know that a written block must be marked as modified. That situation is also why we keep the written blocks alive (and thus, the entire partial block manager) in `OptimisticDataWriter::FinalFlush`. The tests cover all these cases, and often, a combination of these fixes is necessary to fix the test. `insert_order_preserving_odd_sized_batches.test_slow` has a `FIXME`, as I am unsure if we expect the different block counts? Fixes duckdblabs/duckdb-internal#2453. Fixes duckdblabs/duckdb-internal#3931.

…ckdb#16246) Instead of allocating a `std::vector` per group, we now use arena allocations to allocate fixed-size arrays. This greatly reduces contention on the system allocator when there are many threads/groups. It also fixes the scaling issues encountered here: https://duckdb.org/2024/10/25/topn.html I also removed an error thrown when a `NULL` was encountered when creating an `ENUM`. This was unnecessary, as we could just skip.

Follow-up of duckdb#16243. Scraping the bottom of the barrel here, as the previous PR got many of the biggest performance gains already. This PR adds some more fast paths for when there are no `NULL`s, and implements a branchless hash function for `string_t`'s that are inlined. This required some extra care to make sure that the hash function returns the same value whether the string is inlined or not. Overall, the changes reduce the time it takes to write TPC-H SF10 `lineitem` to Parquet from ~2.6s to ~2.4s (with the default `PARQUET_VERSION V1`, ~2.5s to ~2.3s with `V2`).

…kdb#16162) Fix: duckdb#16103

…nd align behavior (duckdb#15944) This PR fixes <duckdblabs/duckdb-internal#3945> The following behavior applies to casting from `VARCHAR` -> `LIST`|`STRUCT`|`MAP` - Backslashes outside of quotes are treated as literals, unless they are followed by a single/double quote, in which case they act as an escape character and are removed from the result. - Backslashes inside of quotes are always treated as escape characters, to use a literal backslash inside quotes, the backslash has to be escaped (`\\`), unescaped backslashes are removed from the result. Quotes that are used to start and end a quoted value are removed from the result. `[""""]` -> `[]` `['''']` -> `[]` `['"']` -> `["]` `["'"]` -> `[']` Whenever your string contains a character that is considered special by the casting logic, it should be wrapped in quotes. To preserve leading or trailing spaces, the value should also be wrapped in quotes. The list of characters that are *always*[1] special: `(whitespace)`, `{`, `}`, `(`, `)`, `[`, `]`, `"`, `'`, `\` Other characters that are special in certain contexts: - STRUCT key: `:` - MAP key: `=` - LIST, MAP, STRUCT value: `,` [1] STRUCT key permits the use of `{`, `}`, `(`, `)`, `[`, `]` ------------------------------------------------------------ This PR also adds support for casting from `VARCHAR` -> unnamed `STRUCT` In the form of `( <value>, <value2>, ..., <value n> )`

Although I haven't run benchmarks, the release binary size has been reduced by ~35KB, which should result in some performance improvements.

taniabogatsch and others added 30 commits January 20, 2025 18:19

wip

25209fb

some tidying

cd9b27a

Merge branch 'v1.2-histrionicus' into reclaim-storage

6068d57

more reclaim space fixes

62d6d87

removing written blocks and moving the optimistic row group collectio…

cbfdb45

…ns into the local table storage

Merge branch 'v1.2-histrionicus' into reclaim-storage

00474d4

implemented the filter

6a807b8

added one test

b335726

some refactoring to prepare for batch insert

63b53f0

use local table storage for batch insert

9203af5

turn into reference

ae5e494

fix broken test

9ebf87d

generate metric enum

2a2d359

add missing includes

4f17c52

tidying some stuff up

f024e1d

Merge branch 'v1.2-histrionicus' into reclaim-storage

82c2b89

this aligns the behavior of varchar->list with that of varchar->struct

3ff8737

WIP: support for escaping in string -> list/struct cast, struct isn't…

edd2747

… finished yet

Merge branch 'v1.2-histrionicus' into reclaim-storage

f7503c6

# Conflicts: # src/execution/operator/persistent/physical_insert.cpp

removing unescaped quotes, perhaps a little too aggressively, still WIP

49558e9

leave the escapes in deeper list levels alone

8866325

add escaped doublequote to test

595e4a8

more WIP, worked on supporting the same escaping in MAP

bb5ca2e

map, struct and list should all work correctly now

8de6cd5

messed up one piece of escape handling logic

9b9df84

Do duckdb_extract_statements to be able to execute pivot

fe5a76f

Issue duckdb#8265: AsOf Nested Loop

1dbbb6c

* For small probe cardinalities, plan a nested loop join + aggregate * Add asof_loop_join_threshold setting to control its use.

Issue duckdb#8265: AsOf Nested Loop

5d8434a

* Add asof_loop_join_threshold loops to tests.

allow escaping whitespace

110808f

better way of dealing with escaped spaces

a5305f5

Mytherin and others added 30 commits February 17, 2025 22:11

[Dev] Fix issue in TRY expression with dictionary_expression Vect…

abc3a3e

…or verification (duckdb#16262) This PR fixes duckdblabs/duckdb-internal#4233 `Execute` complained because the validity mask was dirty, using a vector cache to reset the Vector before every Execute call now.

Fix duckdb#16260: correctly handle parameters in getvariable (duckdb#…

df68ebd

…16264) Fixes duckdb#16260

Fix extension install mode null (duckdb#16268)

5adf302

output from `duckdb_extensions` table function whould show `NULL` as install mode instead of `NOT_INSTALLED`

[Fix] Scanning from normal leaf to nested leaf (duckdb#16270)

4c8d1f9

Fixes duckdb#16074

Fix duckdb#16231: refer to order by condition in ARRAY(SUBQUERY) by a…

cfafb10

…lias instead of by index (duckdb#16272) Fix duckdb#16231

Better comment in optimizer.cpp

7e72364

Issue duckdb#16250: Window Range Performance

aadd543

* Check hints for equality and skip search. fixes: duckdb#16250 fixes: duckdblabs/duckdb-internal#4229

AFL Tests for the CSV reader (duckdb#16280)

e1d1131

Basically cherry-picking the commits of duckdb#16154 to the `v1.2` branch, and reducing the number of tests to a minimal set.

change string hash function again, now inlined strings are hashed bra…

aa82eb9

…nchlessly

Issue duckdb#16250: Window Range Performance (duckdb#16276)

42859d2

* Use two cursors for range searches * Reduces benchmark time from 35s to 25s

some more fast paths

12e96f6

Merge branch 'v1.2-histrionicus'

5c0fc80

Merge branch 'main' into parquet_stuff

e2ce2ca

fix hash function for empty strings and fix test output now that hash…

b2bf617

… function is changed again

Merge branch 'main' into arena_allocate_minmaxn

85bb66f

Merge v1.2-histrionicus into main (duckdb#16284)

138639a

make ValidityMask::RowIsValidUnsafe really unsafe

6682a52

Add pragma to truncate duckdb log storage (duckdb#16274)

7f22b1e

Do duckdb_extract_statements to be able to execute pivot in ADBC (duc…

0876e92

…kdb#16162) Fix: duckdb#16103

make ValidityMask::RowIsValidUnsafe really unsafe (duckdb#16302)

e249a40

Although I haven't run benchmarks, the release binary size has been reduced by ~35KB, which should result in some performance improvements.

Merge branch 'main' into optimizer_remove_unnecessary_projections

b929ea7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizer remove unnecessary projections #232

Optimizer remove unnecessary projections #232

Uh oh!

Tmonster commented Jan 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Optimizer remove unnecessary projections #232

Are you sure you want to change the base?

Optimizer remove unnecessary projections #232

Uh oh!

Conversation

Tmonster commented Jan 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants