Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable primary/secondary index creation for replicated tables #272

Open
fanyang01 opened this issue Dec 9, 2024 · 0 comments
Open

Disable primary/secondary index creation for replicated tables #272

fanyang01 opened this issue Dec 9, 2024 · 0 comments

Comments

@fanyang01
Copy link
Collaborator

fanyang01 commented Dec 9, 2024

Currently, DuckDB's indexes have the following limitations and issues:

  1. Over-eager uniqueness constraint checking:
    DuckDB does not allow updating the same key multiple times within a transaction, not even deleting and then reinserting the same key. This is because the global index entry is not removed during the transaction, leading to a uniqueness constraint violation error. Currently, we rely on an exception to this rule to implement transactional delta flushing: the same key can be deleted multiple times in a transaction. To work around this, we perform an UPSERT followed by a DELETE. However, our experiments show that UPSERT becomes very slow as the table size grows, making it unsuitable for real-time replication.

  2. Index maintenance overhead:
    DuckDB's current index maintenance implementation introduces a significant performance bottleneck during transaction commits. To update the index, all columns of affected row groups are decompressed and loaded into memory (there’s even a FIXME in the source code for this). Ideally, only the indexed columns should be loaded. This inefficiency becomes particularly problematic when tables have a large number of columns or include long string columns. Decompressing such columns is extremely slow. For instance, a profiling run on macOS revealed that ~90% of the processing time was spent loading an unnecessary FSST-compressed string column.

To address these challenges, we have decided to disable index creation for replicated tables. This change significantly reduces update overhead. A potential concern might be slower row lookups without a primary key index. However, this is not an issue in practice: primary keys are typically integers, and joining with a small number of rows is sufficiently fast.

We may re-enable indexes for replicated tables once these limitations are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant