SQLite write contention in /api/collect causes performance degradation under concurrent traffic
Labels: bug, performance, database, help wanted, good first issue
Description
The /api/collect endpoint — the core analytics data collection path that every tracked pageview hits — executes 7-9 independent database operations per request without wrapping them in a transaction. Each operation independently acquires and releases the SQLite write lock.
Under concurrent traffic (100+ simultaneous visitors generating pageviews), this creates write lock thrashing where requests compete for the database lock on every individual statement, severely limiting throughput. Additionally, the lack of a transaction boundary means a crash or error mid-sequence can leave inconsistent data in the database (e.g., a session row without its corresponding page_view row).
A secondary issue: src/lib/db.js only configures 3 SQLite pragmas, leaving significant performance on the table with suboptimal defaults for synchronous, cache_size, mmap_size, and temp_store.
Current Behavior
1. Lock acquisition pattern in /api/collect (src/pages/api/collect.js)
Every single pageview request executes this sequence of independent database operations:
Request arrives at /api/collect
│
├── Step 1: SELECT id FROM sites WHERE id = ? ← read
├── Step 2: SELECT id, page_count FROM sessions WHERE id = ? ← read
│
├── Step 3: INSERT INTO sessions (22 columns) VALUES (...) ← WRITE LOCK #1
│ OR
│ UPDATE sessions SET exit_page, last_activity, ... ← WRITE LOCK #1
│
├── Step 4: SELECT id FROM affiliates WHERE site_id = ? AND slug = ? ← read (conditional)
├── Step 5: SELECT id FROM affiliate_visits WHERE ... ← read (conditional)
├── Step 6: INSERT INTO affiliate_visits (...) ← WRITE LOCK #2 (conditional)
│
├── Step 7: INSERT INTO page_views (7 columns) VALUES (...) ← WRITE LOCK #3
├── Step 8: INSERT INTO daily_stats (...) ON CONFLICT DO UPDATE ... ← WRITE LOCK #4
│
├── Step 9: SELECT 1 FROM sessions WHERE ... (visitor today check) ← read (new sessions only)
└── Step 10: UPDATE daily_stats SET sessions = ..., visitors = ... ← WRITE LOCK #5 (new sessions only)
Each write lock is acquired and released independently (autocommit). Under concurrent load:
- Request A acquires write lock for
INSERT INTO sessions, releases it
- Request B acquires write lock for
INSERT INTO sessions, releases it
- Request A acquires write lock for
INSERT INTO page_views, releases it
- Request B gets
SQLITE_BUSY, waits up to 5 seconds (busy_timeout)
- Request A acquires write lock for
UPSERT daily_stats, releases it
- ... and so on
This interleaving multiplies the total lock contention by 4-5x compared to batching all writes in a single lock acquisition.
2. Partial write risk
If the Node.js process crashes (OOM, SIGKILL, hardware fault) or an error is thrown after INSERT INTO sessions (Step 3) but before INSERT INTO page_views (Step 7), the database is left in an inconsistent state:
- A session row exists with no corresponding page_view
daily_stats may not have been updated
- Affiliate visits may be partially recorded
There is no way to detect or repair these orphaned records.
3. Missing SQLite pragmas in src/lib/db.js
Current configuration (3 pragmas):
db.pragma('journal_mode = WAL'); // Good — enables concurrent reads
db.pragma('foreign_keys = ON'); // Good — data integrity
db.pragma('busy_timeout = 5000'); // Good — 5s retry on busy
Missing pragmas that leave significant performance on the table:
| Pragma |
Default |
Optimal for this workload |
Impact |
synchronous |
FULL (fsync every commit) |
NORMAL |
2-5x write speedup. Safe with WAL mode — WAL file provides durability guarantee. The only theoretical risk is data loss during an OS crash at the exact moment of a WAL checkpoint, which is recoverable via the nightly aggregate.js cron job. |
cache_size |
~2MB (-2000 pages) |
-64000 (64MB) |
More DB pages stay in RAM. Reduces disk I/O for the 15-25 queries in overview.js and the indexed lookups in collect.js. |
mmap_size |
0 (disabled) |
268435456 (256MB) |
Memory-maps the DB file. The OS virtual memory system handles page access, which is faster than SQLite's read-through-cache for read-heavy workloads. Well-supported on macOS and Linux. |
temp_store |
DISK |
MEMORY |
Temporary tables and indices (used internally by SQLite for ORDER BY, GROUP BY, DISTINCT, subqueries) stored in RAM instead of disk. |
4. No dashboard query caching
All analytics dashboard endpoints (src/pages/api/analytics/[siteId]/*.js) execute full SQL queries on every request:
| Endpoint |
Queries per request |
Current caching |
overview.js |
15-25 queries |
None |
realtime.js |
1 query (with subquery) |
None |
conversions.js |
5 queries + N+1 enrichment loop |
None |
affiliates.js |
2 queries with subquery JOINs |
None |
affiliates/[affiliateId].js |
7 queries |
None |
visitor-journey.js |
4 queries |
None |
flow.js |
1-2 queries |
60s LRU cache (already implemented) |
The overview.js endpoint is the heaviest — it runs up to 25 SQL queries including COUNT(DISTINCT visitor_id), multi-table JOINs, and GROUP BY aggregations. Every dashboard page load triggers all of them, even if the data hasn't changed in the last minute.
Expected Behavior
- All database writes in
/api/collect should execute inside a single transaction — one lock acquisition, atomic commit/rollback
- SQLite should be configured with optimal pragmas for an analytics write-heavy + dashboard read-heavy workload
- High-traffic scenarios (100+ concurrent pageviews/sec) should be handled via write batching to amortize transaction overhead
- Dashboard queries should be cached in memory with appropriate TTLs to avoid redundant SQL execution
Impact
Estimated throughput limits
| Scenario |
Current |
With all fixes |
| Sustained pageviews/sec |
~150-300 |
~2,000-5,000 |
| Peak concurrent users |
~500 |
~2,000-5,000 |
| Dashboard overview load time |
~50-200ms |
~1ms (cache hit) |
| Write latency per pageview |
~5-10ms (7-9 autocommits) |
~0ms (buffered, async flush) |
Who is affected
Any self-hosted deployment with more than a few hundred concurrent visitors across all tracked sites. The bottleneck is per-server (single SQLite file), not per-site.
Reproduction
Simulating lock contention
# Start the dev server
yarn dev
# In another terminal, send 50 concurrent pageviews using GNU parallel or xargs
# (requires a valid site_id in your database — replace 1 with your site ID)
seq 1 50 | xargs -P 50 -I {} curl -s -o /dev/null -w "req {} → %{http_code} in %{time_total}s\n" \
-X POST http://localhost:3000/api/collect \
-H 'Content-Type: application/json' \
-d '{"site_id":1,"visitor_id":"load_test_{}","session_id":"load_sess_{}","type":"pageview","pathname":"/load-test-{}","url":"http://example.com/load-test-{}","hostname":"example.com"}'
What you'll observe:
- First few requests complete in ~5-20ms
- Later requests queue up, taking 100-500ms+ as they wait for the busy timeout
- Under extreme load, some requests may hit the 5-second
busy_timeout and fail with 500
Verifying inconsistent data (partial write)
# Start the dev server, then kill it mid-request:
# 1. Send a slow burst of pageviews
# 2. While they're processing, kill the Node.js process: kill -9 <pid>
# 3. Check the database:
sqlite3 data/analytics.db "
SELECT s.id, s.page_count, COUNT(pv.id) as actual_page_views
FROM sessions s
LEFT JOIN page_views pv ON pv.session_id = s.id
GROUP BY s.id
HAVING s.page_count != actual_page_views;
"
# Any rows returned indicate inconsistent data from partial writes
Proposed Solution
A 4-phase optimization plan, each phase independently deployable and verifiable. No breaking changes, no new dependencies, no schema migrations.
Phase 1: Add performance pragmas to src/lib/db.js
Impact: 2-3x read/write speedup | Risk: Very Low | Files: 1
Add 4 pragmas after the existing 3 in getDb():
// Existing
db.pragma('journal_mode = WAL');
db.pragma('foreign_keys = ON');
db.pragma('busy_timeout = 5000');
// Add these
db.pragma('synchronous = NORMAL'); // Skip fsync per commit (safe with WAL)
db.pragma('cache_size = -64000'); // 64MB page cache (default ~2MB)
db.pragma('mmap_size = 268435456'); // Memory-map up to 256MB of DB file
db.pragma('temp_store = MEMORY'); // Temp tables/sorts in RAM
Why each pragma is safe:
synchronous = NORMAL: With WAL already enabled, NORMAL is the SQLite-recommended setting. The WAL file itself provides the durability guarantee. The only risk is data loss during an OS-level crash at the exact moment of a WAL checkpoint — recoverable via the nightly cron/aggregate.js recomputation.
cache_size = -64000: Memory-only change. Reverts to default on restart. 64MB is modest for a server process.
mmap_size = 268435456: Read-only optimization. If the DB file is smaller than 256MB, only the actual file size is mapped. No effect on writes.
temp_store = MEMORY: Affects only SQLite-internal temporary tables. No impact on data or schema.
Verification:
yarn dev — app starts, no errors
- Dashboard loads, all data renders
- Send a pageview via curl — 200 response, data persists
Rollback: Remove the 4 lines. No data changes.
Phase 2: Wrap /api/collect in a single transaction
Impact: 4-5x reduced lock contention | Risk: Low | Files: 1
Wrap lines 60-171 of src/pages/api/collect.js in db.transaction():
// BEFORE — 7-9 independent autocommits
const existingSession = db.prepare('SELECT ...').get(data.session_id);
if (!existingSession) {
db.prepare('INSERT INTO sessions ...').run(...); // autocommit #1
} else {
db.prepare('UPDATE sessions ...').run(...); // autocommit #1
}
if (data.ref) { /* affiliate INSERTs */ } // autocommit #2
db.prepare('INSERT INTO page_views ...').run(...); // autocommit #3
db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...); // autocommit #4
db.prepare('UPDATE daily_stats ...').run(...); // autocommit #5
// AFTER — single transaction, single lock acquisition
const processPageview = db.transaction(() => {
const existingSession = db.prepare('SELECT ...').get(data.session_id);
if (!existingSession) {
db.prepare('INSERT INTO sessions ...').run(...);
} else {
db.prepare('UPDATE sessions ...').run(...);
}
if (data.ref) { /* affiliate INSERTs — same code */ }
db.prepare('INSERT INTO page_views ...').run(...);
db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...);
db.prepare('UPDATE daily_stats ...').run(...);
});
processPageview();
What stays OUTSIDE the transaction (unchanged):
- Input validation →
400 response
- Site lookup (
SELECT id FROM sites) → 404 response
- UA parsing, geo header extraction → pure CPU, no DB
What goes INSIDE the transaction:
- Session INSERT or UPDATE
- Affiliate tracking (SELECT + conditional INSERT)
- Page view INSERT
- Daily stats UPSERT + visitor counting UPDATE
Why this is safe:
- All code is synchronous —
better-sqlite3 transactions require synchronous callbacks, and collect.js has zero async/await. Perfect fit.
- Same pattern already used in
src/lib/gsc-sync.js (line 96) and src/lib/backup.js (line 46).
- Atomic rollback is strictly better — today a crash mid-sequence leaves orphaned rows. With a transaction, either all operations succeed or none do.
- SQL queries are identical — no logic changes, no query changes, just wrapping.
Verification:
- New session pageview → rows in
sessions, page_views, daily_stats ✓
- Existing session pageview →
exit_page updated, page_count incremented ✓
- Affiliate pageview →
affiliate_visits row created ✓
- Invalid
site_id → 404, no transaction started ✓
- Dashboard loads correctly ✓
Rollback: Revert collect.js to original. Identical behavior.
Phase 3: Add write batching/buffering to /api/collect
Impact: ~10x write throughput | Risk: Medium | Files: 1
Buffer incoming pageview requests in memory and flush them in a single batched transaction every 100ms:
Request → validate site → parse UA → push to writeBuffer → respond 202 Accepted
│
Timer (100ms) or buffer full (500 items)
│
flushBuffer()
│
db.transaction(() => {
for (item of batch) processOnePageview(db, item)
})
Key design decisions:
| Decision |
Rationale |
| 200 → 202 Accepted |
sendBeacon() is fire-and-forget, doesn't check status. XHR fallback in public/t.js also doesn't check. Change is invisible to all clients. |
| Site validation before buffering |
Invalid sites rejected immediately with 404. No garbage enters the buffer. Cheap indexed lookup (primary key). |
| UA parsing before buffering |
Buffer stores fully-parsed, ready-to-insert data. Keeps flush function simple. |
writeBuffer.splice(0) for drain |
Atomically removes and returns all items. Safe in single-threaded Node.js — no race between push and drain. |
| MAX_BUFFER_SIZE = 500 |
Force-flush if buffer grows large. Prevents unbounded memory under extreme burst. |
| Graceful shutdown handlers |
SIGTERM/SIGINT/beforeExit flush the buffer before process exit. Zero data loss during deployments. |
| Error: log and drop |
Analytics is best-effort. If a batch transaction fails (disk full, corruption), retrying the same data would loop. Log the error with batch size for monitoring. |
Verification:
- Single pageview → 202, appears in DB within 200ms ✓
- Burst of 10 rapid pageviews → all 202, all in DB ✓
daily_stats.page_views incremented by 10 ✓
- Real tracking script on a page → works end-to-end ✓
- Kill dev server → no "items lost" error (graceful flush) ✓
Rollback: Replace with Phase 2 version (transaction-only). 202→200 transparent.
Phase 4: Add in-memory query cache for dashboard endpoints
Impact: ~100x dashboard speedup on cache hits | Risk: Low | Files: 7 (1 new + 6 modified)
Independent of Phase 3 — can be implemented after any earlier phase.
New file: src/lib/cache.js
A generic TTL cache with LRU eviction, following the same Map-based pattern already used in src/pages/api/analytics/[siteId]/flow.js (line 22):
import { createCache } from '@/lib/cache';
const cache = createCache({ ttlMs: 60000, maxEntries: 200 });
// In any endpoint handler, after auth check:
const cacheKey = cache.buildKey(siteId, req.query);
const cached = cache.get(cacheKey);
if (cached) return res.status(200).json(cached);
// ... existing query logic (unchanged) ...
cache.set(cacheKey, result);
res.status(200).json(result);
Cache configuration per endpoint:
| Endpoint |
TTL |
Max Entries |
Rationale |
overview.js |
60s |
200 |
Heaviest endpoint (15-25 queries). Dashboard data doesn't change per-second. |
realtime.js |
10s |
50 |
Must stay fresh. Still eliminates duplicate concurrent requests (multiple tabs). |
conversions.js |
30s |
200 |
Medium weight. Eliminates the N+1 enrichment loop on repeated loads. |
affiliates.js |
60s |
100 |
Aggregate data. Changes slowly. |
affiliates/[affiliateId].js |
60s |
100 |
Detail view with 7 queries. |
visitor-journey.js |
120s |
100 |
Historical data. Rarely changes. |
flow.js |
— |
— |
Already cached (60s LRU). Do not modify. |
Cache key design:
- Format:
{siteId}|{sorted query params}
buildKey(siteId, req.query) sorts query keys alphabetically so ?period=30d&tab=sources and ?tab=sources&period=30d hit the same cache entry
siteId is always part of the key (different sites never share cache)
Memory bounds:
- Each cache entry is a JSON response object, typically 5-50KB
- With
maxEntries: 200, worst case per cache instance: ~10MB
- 6 cache instances × 10MB = ~60MB total worst case
- LRU eviction kicks in when
maxEntries is reached
Verification:
- Hit overview twice → second request ~100x faster ✓
- Compare response bodies → identical JSON ✓
- Wait 60s, request again → fresh data (cache expired) ✓
- Different params → separate cache entries ✓
flow.js unaffected (its own cache still works) ✓
Rollback: Remove 3-5 lines per endpoint. Delete cache.js.
Summary
| Phase |
What |
Files Changed |
New Deps |
Schema Change |
Breaking Change |
Throughput Impact |
| 1 |
Performance pragmas |
1 |
0 |
None |
None |
2-3x |
| 2 |
Transaction wrapping |
1 |
0 |
None |
None |
3-5x |
| 3 |
Write batching |
1 |
0 |
None |
200→202 (invisible) |
~10x |
| 4 |
Query caching |
7 (1 new) |
0 |
None |
None |
~100x reads |
Combined result: ~150-300 pageviews/sec → ~2,000-5,000 pageviews/sec. Dashboard queries from ~50-200ms → ~1ms on cache hit.
Phase dependencies
Phase 1 (pragmas) → Phase 2 (transaction) → Phase 3 (batching)
│
Phase 4 (caching) ─── independent ────────────┘
Each phase is independently deployable and can be merged as a separate PR. Phase 4 has no dependency on Phase 3 — it can be implemented at any point.
Environment
- Database: SQLite via
better-sqlite3
- Runtime: Node.js 20+
- Framework: Next.js 16 (Pages Router)
- Relevant files:
src/lib/db.js — database connection and pragma configuration
src/pages/api/collect.js — analytics data collection endpoint
src/pages/api/analytics/[siteId]/*.js — dashboard query endpoints
src/lib/gsc-sync.js line 96 — existing db.transaction() usage (reference)
src/lib/backup.js line 46 — existing db.transaction() usage (reference)
src/pages/api/analytics/[siteId]/flow.js line 22 — existing Map LRU cache (reference)
SQLite write contention in
/api/collectcauses performance degradation under concurrent trafficLabels:
bug,performance,database,help wanted,good first issueDescription
The
/api/collectendpoint — the core analytics data collection path that every tracked pageview hits — executes 7-9 independent database operations per request without wrapping them in a transaction. Each operation independently acquires and releases the SQLite write lock.Under concurrent traffic (100+ simultaneous visitors generating pageviews), this creates write lock thrashing where requests compete for the database lock on every individual statement, severely limiting throughput. Additionally, the lack of a transaction boundary means a crash or error mid-sequence can leave inconsistent data in the database (e.g., a session row without its corresponding page_view row).
A secondary issue:
src/lib/db.jsonly configures 3 SQLite pragmas, leaving significant performance on the table with suboptimal defaults forsynchronous,cache_size,mmap_size, andtemp_store.Current Behavior
1. Lock acquisition pattern in
/api/collect(src/pages/api/collect.js)Every single pageview request executes this sequence of independent database operations:
Each write lock is acquired and released independently (autocommit). Under concurrent load:
INSERT INTO sessions, releases itINSERT INTO sessions, releases itINSERT INTO page_views, releases itSQLITE_BUSY, waits up to 5 seconds (busy_timeout)UPSERT daily_stats, releases itThis interleaving multiplies the total lock contention by 4-5x compared to batching all writes in a single lock acquisition.
2. Partial write risk
If the Node.js process crashes (OOM, SIGKILL, hardware fault) or an error is thrown after
INSERT INTO sessions(Step 3) but beforeINSERT INTO page_views(Step 7), the database is left in an inconsistent state:daily_statsmay not have been updatedThere is no way to detect or repair these orphaned records.
3. Missing SQLite pragmas in
src/lib/db.jsCurrent configuration (3 pragmas):
Missing pragmas that leave significant performance on the table:
synchronousFULL(fsync every commit)NORMALaggregate.jscron job.cache_size~2MB(-2000 pages)-64000(64MB)overview.jsand the indexed lookups incollect.js.mmap_size0(disabled)268435456(256MB)temp_storeDISKMEMORYORDER BY,GROUP BY,DISTINCT, subqueries) stored in RAM instead of disk.4. No dashboard query caching
All analytics dashboard endpoints (
src/pages/api/analytics/[siteId]/*.js) execute full SQL queries on every request:overview.jsrealtime.jsconversions.jsaffiliates.jsaffiliates/[affiliateId].jsvisitor-journey.jsflow.jsThe
overview.jsendpoint is the heaviest — it runs up to 25 SQL queries includingCOUNT(DISTINCT visitor_id), multi-table JOINs, andGROUP BYaggregations. Every dashboard page load triggers all of them, even if the data hasn't changed in the last minute.Expected Behavior
/api/collectshould execute inside a single transaction — one lock acquisition, atomic commit/rollbackImpact
Estimated throughput limits
Who is affected
Any self-hosted deployment with more than a few hundred concurrent visitors across all tracked sites. The bottleneck is per-server (single SQLite file), not per-site.
Reproduction
Simulating lock contention
What you'll observe:
busy_timeoutand fail with 500Verifying inconsistent data (partial write)
Proposed Solution
A 4-phase optimization plan, each phase independently deployable and verifiable. No breaking changes, no new dependencies, no schema migrations.
Phase 1: Add performance pragmas to
src/lib/db.jsImpact: 2-3x read/write speedup | Risk: Very Low | Files: 1
Add 4 pragmas after the existing 3 in
getDb():Why each pragma is safe:
synchronous = NORMAL: With WAL already enabled,NORMALis the SQLite-recommended setting. The WAL file itself provides the durability guarantee. The only risk is data loss during an OS-level crash at the exact moment of a WAL checkpoint — recoverable via the nightlycron/aggregate.jsrecomputation.cache_size = -64000: Memory-only change. Reverts to default on restart. 64MB is modest for a server process.mmap_size = 268435456: Read-only optimization. If the DB file is smaller than 256MB, only the actual file size is mapped. No effect on writes.temp_store = MEMORY: Affects only SQLite-internal temporary tables. No impact on data or schema.Verification:
yarn dev— app starts, no errorsRollback: Remove the 4 lines. No data changes.
Phase 2: Wrap
/api/collectin a single transactionImpact: 4-5x reduced lock contention | Risk: Low | Files: 1
Wrap lines 60-171 of
src/pages/api/collect.jsindb.transaction():What stays OUTSIDE the transaction (unchanged):
400responseSELECT id FROM sites) →404responseWhat goes INSIDE the transaction:
Why this is safe:
better-sqlite3transactions require synchronous callbacks, andcollect.jshas zero async/await. Perfect fit.src/lib/gsc-sync.js(line 96) andsrc/lib/backup.js(line 46).Verification:
sessions,page_views,daily_stats✓exit_pageupdated,page_countincremented ✓affiliate_visitsrow created ✓site_id→ 404, no transaction started ✓Rollback: Revert
collect.jsto original. Identical behavior.Phase 3: Add write batching/buffering to
/api/collectImpact: ~10x write throughput | Risk: Medium | Files: 1
Buffer incoming pageview requests in memory and flush them in a single batched transaction every 100ms:
Key design decisions:
sendBeacon()is fire-and-forget, doesn't check status. XHR fallback inpublic/t.jsalso doesn't check. Change is invisible to all clients.writeBuffer.splice(0)for drainSIGTERM/SIGINT/beforeExitflush the buffer before process exit. Zero data loss during deployments.Verification:
daily_stats.page_viewsincremented by 10 ✓Rollback: Replace with Phase 2 version (transaction-only). 202→200 transparent.
Phase 4: Add in-memory query cache for dashboard endpoints
Impact: ~100x dashboard speedup on cache hits | Risk: Low | Files: 7 (1 new + 6 modified)
Independent of Phase 3 — can be implemented after any earlier phase.
New file:
src/lib/cache.jsA generic TTL cache with LRU eviction, following the same
Map-based pattern already used insrc/pages/api/analytics/[siteId]/flow.js(line 22):Cache configuration per endpoint:
overview.jsrealtime.jsconversions.jsaffiliates.jsaffiliates/[affiliateId].jsvisitor-journey.jsflow.jsCache key design:
{siteId}|{sorted query params}buildKey(siteId, req.query)sorts query keys alphabetically so?period=30d&tab=sourcesand?tab=sources&period=30dhit the same cache entrysiteIdis always part of the key (different sites never share cache)Memory bounds:
maxEntries: 200, worst case per cache instance: ~10MBmaxEntriesis reachedVerification:
flow.jsunaffected (its own cache still works) ✓Rollback: Remove 3-5 lines per endpoint. Delete
cache.js.Summary
Combined result: ~150-300 pageviews/sec → ~2,000-5,000 pageviews/sec. Dashboard queries from ~50-200ms → ~1ms on cache hit.
Phase dependencies
Each phase is independently deployable and can be merged as a separate PR. Phase 4 has no dependency on Phase 3 — it can be implemented at any point.
Environment
better-sqlite3src/lib/db.js— database connection and pragma configurationsrc/pages/api/collect.js— analytics data collection endpointsrc/pages/api/analytics/[siteId]/*.js— dashboard query endpointssrc/lib/gsc-sync.jsline 96 — existingdb.transaction()usage (reference)src/lib/backup.jsline 46 — existingdb.transaction()usage (reference)src/pages/api/analytics/[siteId]/flow.jsline 22 — existingMapLRU cache (reference)