You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, DuckDB's indexes have the following limitations and issues:
Over-eager uniqueness constraint checking:
DuckDB does not allow updating the same key multiple times within a transaction, not even deleting and then reinserting the same key. This is because the global index entry is not removed during the transaction, leading to a uniqueness constraint violation error. Currently, we rely on an exception to this rule to implement transactional delta flushing: the same key can be deleted multiple times in a transaction. To work around this, we perform an UPSERT followed by a DELETE. However, our experiments show that UPSERT becomes very slow as the table size grows, making it unsuitable for real-time replication.
Index maintenance overhead:
DuckDB's current index maintenance implementation introduces a significant performance bottleneck during transaction commits. To update the index, all columns of affected row groups are decompressed and loaded into memory (there’s even a FIXME in the source code for this). Ideally, only the indexed columns should be loaded. This inefficiency becomes particularly problematic when tables have a large number of columns or include long string columns. Decompressing such columns is extremely slow. For instance, a profiling run on macOS revealed that ~90% of the processing time was spent loading an unnecessary FSST-compressed string column.
To address these challenges, we have decided to disable index creation for replicated tables. This change significantly reduces update overhead. A potential concern might be slower row lookups without a primary key index. However, this is not an issue in practice: primary keys are typically integers, and joining with a small number of rows is sufficiently fast.
We may re-enable indexes for replicated tables once these limitations are resolved.
The text was updated successfully, but these errors were encountered:
Currently, DuckDB's indexes have the following limitations and issues:
Over-eager uniqueness constraint checking:
DuckDB does not allow updating the same key multiple times within a transaction, not even deleting and then reinserting the same key. This is because the global index entry is not removed during the transaction, leading to a uniqueness constraint violation error. Currently, we rely on an exception to this rule to implement transactional delta flushing: the same key can be deleted multiple times in a transaction. To work around this, we perform an UPSERT followed by a DELETE. However, our experiments show that UPSERT becomes very slow as the table size grows, making it unsuitable for real-time replication.
Index maintenance overhead:
DuckDB's current index maintenance implementation introduces a significant performance bottleneck during transaction commits. To update the index, all columns of affected row groups are decompressed and loaded into memory (there’s even a FIXME in the source code for this). Ideally, only the indexed columns should be loaded. This inefficiency becomes particularly problematic when tables have a large number of columns or include long string columns. Decompressing such columns is extremely slow. For instance, a profiling run on macOS revealed that ~90% of the processing time was spent loading an unnecessary FSST-compressed string column.
To address these challenges, we have decided to disable index creation for replicated tables. This change significantly reduces update overhead. A potential concern might be slower row lookups without a primary key index. However, this is not an issue in practice: primary keys are typically integers, and joining with a small number of rows is sufficiently fast.
We may re-enable indexes for replicated tables once these limitations are resolved.
The text was updated successfully, but these errors were encountered: