Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Rewrite the "sync_local" query #56

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Feb 3, 2025

#40 fixed a performance issue in initial/bulk sync when there are many duplicate row_ids, but decreased the performance slightly for the general case. This attempts to optimize it again, mostly by removing the second temp b-tree used in query execution.

This does not make a massive difference in overall initial sync performance. On my machine, with 1M ops, the query time reduces from around 5s -> 3s, versus a total initial sync time of 60s. So it's not a big gain overall, but this is the slowest query that locks the database for writes and cannot be split into smaller subqueries, so any optimization here helps with app responsiveness.

There is another query form added in the comments, which can take the initial sync query time down in the above case to under 2s (with no temp b-tree at all), but it doesn't cater for incremental updates. It needs some stats tracking / heuristics added to know when to use one query or the other, so I'm leaving that for later.

The temp b-trees used by this query could also be related to RangeError: Maximum call stack size exceeded errors seen on iOS, as well as disk I/O error occasionally seen on Android, when SQLite is configured to use files for temporary storage. While those issues have other workarounds, any changes to reduce temporary b-trees here could help.

TODO:

  • Regression testing
  • Test real-world performance

@rkistner
Copy link
Contributor Author

rkistner commented Feb 7, 2025

I did some performance benchmarks on web, with 1M rows with large ids. This forces SQLite to write the temp b-tree to disk.

Takeways:

  1. The optimized query here makes a minor but still significant difference with native and OPFS.
  2. The optimized "sync from scratch" query gives a major performance boost, so it's worth investigating that further.
  3. OPFSCoopSyncVFS appears to scale similar to native performance, just taking 2x as long for any query.
  4. IDBBatchAtomicVFS scales much worse, and can take 10x as long as long as OPFSCoopSyncVFS with large datasets, when it works at all.
  5. With IDBBatchAtomicVFS, we cannot sync large datasets at all without PRAGMA temp_store = 2.
  6. With IDBBatchAtomicVFS, there is a lot of filesystem overhead. We can however use PRAGMA temp_store = 2, PRAGMA cache_size = -200000 to let SQLite get similar read performance to OPFS due to everything being in memory.

Linux, native:

  1. Original query: 2.01s
  2. Optimized query: 1.42s
  3. Optimized "sync from scratch" query: 0.254s

Web, IDBBatchAtomicVFS:

  1. Original query: IDBBatchAtomicVFS.js:553 RangeError: offset is out of bounds after around 10-15s.
  2. Optimized query: IDBBatchAtomicVFS.js:553 RangeError: offset is out of bounds after around 10-15s.
  3. Optimized "sync from scratch" query: 5.81s

Linux, OPFSCoopSyncVFS:

  1. Original query: 4.38s
  2. Optimized query: 2.68s
  3. Optimized "sync from scratch" query: 0.502s

Web, IDBBatchAtomicVFS, PRAGMA temp_store = 2:

  1. Original query: 40s.
  2. Optimized query: 21s.
  3. Optimized "sync from scratch" query: 5.81s

Web, IDBBatchAtomicVFS, PRAGMA temp_store = 2, PRAGMA cache_size = -200000, pre-warmed cache:

  1. Original query: 4.43s.
  2. Optimized query: 3.30s.
  3. Optimized "sync from scratch" query: 0.727s

The test data:

BEGIN TRANSACTION;

-- 1M ops
WITH RECURSIVE generate_rows(n) AS (
    SELECT 1
    UNION ALL
    SELECT n + 1 FROM generate_rows WHERE n < 1000000
)
INSERT INTO ps_oplog (bucket, op_id, row_type, row_id, key, data, hash)
SELECT 
    (n % 10), -- Generate 10 different buckets
    n,
    'thisisatable',
    'thisismyrowid' || n,
    'thisisarowkeykey_' || n,
    '{"n": ' || n || '}',
    (n * 17) % 1000000000 -- Some pseudo-random hash
    
FROM generate_rows;

-- 10 buckets
WITH RECURSIVE generate_rows(n) AS (
    SELECT 1
    UNION ALL
    SELECT n + 1 FROM generate_rows WHERE n < 10
)
INSERT INTO ps_buckets (id, name)
SELECT 
    (n % 10),
    'bucket' || n
    
FROM generate_rows;

COMMIT;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant