Skip to content

Performance Improvement Suggestions #4

@djeets

Description

@djeets

SQLite write contention in /api/collect causes performance degradation under concurrent traffic

Labels: bug, performance, database, help wanted, good first issue


Description

The /api/collect endpoint — the core analytics data collection path that every tracked pageview hits — executes 7-9 independent database operations per request without wrapping them in a transaction. Each operation independently acquires and releases the SQLite write lock.

Under concurrent traffic (100+ simultaneous visitors generating pageviews), this creates write lock thrashing where requests compete for the database lock on every individual statement, severely limiting throughput. Additionally, the lack of a transaction boundary means a crash or error mid-sequence can leave inconsistent data in the database (e.g., a session row without its corresponding page_view row).

A secondary issue: src/lib/db.js only configures 3 SQLite pragmas, leaving significant performance on the table with suboptimal defaults for synchronous, cache_size, mmap_size, and temp_store.


Current Behavior

1. Lock acquisition pattern in /api/collect (src/pages/api/collect.js)

Every single pageview request executes this sequence of independent database operations:

Request arrives at /api/collect
│
├── Step 1:  SELECT id FROM sites WHERE id = ?                          ← read
├── Step 2:  SELECT id, page_count FROM sessions WHERE id = ?           ← read
│
├── Step 3:  INSERT INTO sessions (22 columns) VALUES (...)             ← WRITE LOCK #1
│            OR
│            UPDATE sessions SET exit_page, last_activity, ...          ← WRITE LOCK #1
│
├── Step 4:  SELECT id FROM affiliates WHERE site_id = ? AND slug = ?   ← read (conditional)
├── Step 5:  SELECT id FROM affiliate_visits WHERE ...                  ← read (conditional)
├── Step 6:  INSERT INTO affiliate_visits (...)                         ← WRITE LOCK #2 (conditional)
│
├── Step 7:  INSERT INTO page_views (7 columns) VALUES (...)            ← WRITE LOCK #3
├── Step 8:  INSERT INTO daily_stats (...) ON CONFLICT DO UPDATE ...    ← WRITE LOCK #4
│
├── Step 9:  SELECT 1 FROM sessions WHERE ... (visitor today check)     ← read (new sessions only)
└── Step 10: UPDATE daily_stats SET sessions = ..., visitors = ...      ← WRITE LOCK #5 (new sessions only)

Each write lock is acquired and released independently (autocommit). Under concurrent load:

  • Request A acquires write lock for INSERT INTO sessions, releases it
  • Request B acquires write lock for INSERT INTO sessions, releases it
  • Request A acquires write lock for INSERT INTO page_views, releases it
  • Request B gets SQLITE_BUSY, waits up to 5 seconds (busy_timeout)
  • Request A acquires write lock for UPSERT daily_stats, releases it
  • ... and so on

This interleaving multiplies the total lock contention by 4-5x compared to batching all writes in a single lock acquisition.

2. Partial write risk

If the Node.js process crashes (OOM, SIGKILL, hardware fault) or an error is thrown after INSERT INTO sessions (Step 3) but before INSERT INTO page_views (Step 7), the database is left in an inconsistent state:

  • A session row exists with no corresponding page_view
  • daily_stats may not have been updated
  • Affiliate visits may be partially recorded

There is no way to detect or repair these orphaned records.

3. Missing SQLite pragmas in src/lib/db.js

Current configuration (3 pragmas):

db.pragma('journal_mode = WAL');     // Good — enables concurrent reads
db.pragma('foreign_keys = ON');       // Good — data integrity
db.pragma('busy_timeout = 5000');     // Good — 5s retry on busy

Missing pragmas that leave significant performance on the table:

Pragma Default Optimal for this workload Impact
synchronous FULL (fsync every commit) NORMAL 2-5x write speedup. Safe with WAL mode — WAL file provides durability guarantee. The only theoretical risk is data loss during an OS crash at the exact moment of a WAL checkpoint, which is recoverable via the nightly aggregate.js cron job.
cache_size ~2MB (-2000 pages) -64000 (64MB) More DB pages stay in RAM. Reduces disk I/O for the 15-25 queries in overview.js and the indexed lookups in collect.js.
mmap_size 0 (disabled) 268435456 (256MB) Memory-maps the DB file. The OS virtual memory system handles page access, which is faster than SQLite's read-through-cache for read-heavy workloads. Well-supported on macOS and Linux.
temp_store DISK MEMORY Temporary tables and indices (used internally by SQLite for ORDER BY, GROUP BY, DISTINCT, subqueries) stored in RAM instead of disk.

4. No dashboard query caching

All analytics dashboard endpoints (src/pages/api/analytics/[siteId]/*.js) execute full SQL queries on every request:

Endpoint Queries per request Current caching
overview.js 15-25 queries None
realtime.js 1 query (with subquery) None
conversions.js 5 queries + N+1 enrichment loop None
affiliates.js 2 queries with subquery JOINs None
affiliates/[affiliateId].js 7 queries None
visitor-journey.js 4 queries None
flow.js 1-2 queries 60s LRU cache (already implemented)

The overview.js endpoint is the heaviest — it runs up to 25 SQL queries including COUNT(DISTINCT visitor_id), multi-table JOINs, and GROUP BY aggregations. Every dashboard page load triggers all of them, even if the data hasn't changed in the last minute.


Expected Behavior

  1. All database writes in /api/collect should execute inside a single transaction — one lock acquisition, atomic commit/rollback
  2. SQLite should be configured with optimal pragmas for an analytics write-heavy + dashboard read-heavy workload
  3. High-traffic scenarios (100+ concurrent pageviews/sec) should be handled via write batching to amortize transaction overhead
  4. Dashboard queries should be cached in memory with appropriate TTLs to avoid redundant SQL execution

Impact

Estimated throughput limits

Scenario Current With all fixes
Sustained pageviews/sec ~150-300 ~2,000-5,000
Peak concurrent users ~500 ~2,000-5,000
Dashboard overview load time ~50-200ms ~1ms (cache hit)
Write latency per pageview ~5-10ms (7-9 autocommits) ~0ms (buffered, async flush)

Who is affected

Any self-hosted deployment with more than a few hundred concurrent visitors across all tracked sites. The bottleneck is per-server (single SQLite file), not per-site.


Reproduction

Simulating lock contention

# Start the dev server
yarn dev

# In another terminal, send 50 concurrent pageviews using GNU parallel or xargs
# (requires a valid site_id in your database — replace 1 with your site ID)

seq 1 50 | xargs -P 50 -I {} curl -s -o /dev/null -w "req {} → %{http_code} in %{time_total}s\n" \
  -X POST http://localhost:3000/api/collect \
  -H 'Content-Type: application/json' \
  -d '{"site_id":1,"visitor_id":"load_test_{}","session_id":"load_sess_{}","type":"pageview","pathname":"/load-test-{}","url":"http://example.com/load-test-{}","hostname":"example.com"}'

What you'll observe:

  • First few requests complete in ~5-20ms
  • Later requests queue up, taking 100-500ms+ as they wait for the busy timeout
  • Under extreme load, some requests may hit the 5-second busy_timeout and fail with 500

Verifying inconsistent data (partial write)

# Start the dev server, then kill it mid-request:
# 1. Send a slow burst of pageviews
# 2. While they're processing, kill the Node.js process: kill -9 <pid>
# 3. Check the database:

sqlite3 data/analytics.db "
  SELECT s.id, s.page_count, COUNT(pv.id) as actual_page_views
  FROM sessions s
  LEFT JOIN page_views pv ON pv.session_id = s.id
  GROUP BY s.id
  HAVING s.page_count != actual_page_views;
"
# Any rows returned indicate inconsistent data from partial writes

Proposed Solution

A 4-phase optimization plan, each phase independently deployable and verifiable. No breaking changes, no new dependencies, no schema migrations.

Phase 1: Add performance pragmas to src/lib/db.js

Impact: 2-3x read/write speedup | Risk: Very Low | Files: 1

Add 4 pragmas after the existing 3 in getDb():

// Existing
db.pragma('journal_mode = WAL');
db.pragma('foreign_keys = ON');
db.pragma('busy_timeout = 5000');

// Add these
db.pragma('synchronous = NORMAL');      // Skip fsync per commit (safe with WAL)
db.pragma('cache_size = -64000');        // 64MB page cache (default ~2MB)
db.pragma('mmap_size = 268435456');      // Memory-map up to 256MB of DB file
db.pragma('temp_store = MEMORY');        // Temp tables/sorts in RAM

Why each pragma is safe:

  • synchronous = NORMAL: With WAL already enabled, NORMAL is the SQLite-recommended setting. The WAL file itself provides the durability guarantee. The only risk is data loss during an OS-level crash at the exact moment of a WAL checkpoint — recoverable via the nightly cron/aggregate.js recomputation.
  • cache_size = -64000: Memory-only change. Reverts to default on restart. 64MB is modest for a server process.
  • mmap_size = 268435456: Read-only optimization. If the DB file is smaller than 256MB, only the actual file size is mapped. No effect on writes.
  • temp_store = MEMORY: Affects only SQLite-internal temporary tables. No impact on data or schema.

Verification:

  1. yarn dev — app starts, no errors
  2. Dashboard loads, all data renders
  3. Send a pageview via curl — 200 response, data persists

Rollback: Remove the 4 lines. No data changes.


Phase 2: Wrap /api/collect in a single transaction

Impact: 4-5x reduced lock contention | Risk: Low | Files: 1

Wrap lines 60-171 of src/pages/api/collect.js in db.transaction():

// BEFORE — 7-9 independent autocommits
const existingSession = db.prepare('SELECT ...').get(data.session_id);
if (!existingSession) {
  db.prepare('INSERT INTO sessions ...').run(...);       // autocommit #1
} else {
  db.prepare('UPDATE sessions ...').run(...);            // autocommit #1
}
if (data.ref) { /* affiliate INSERTs */ }                // autocommit #2
db.prepare('INSERT INTO page_views ...').run(...);       // autocommit #3
db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...);  // autocommit #4
db.prepare('UPDATE daily_stats ...').run(...);            // autocommit #5


// AFTER — single transaction, single lock acquisition
const processPageview = db.transaction(() => {
  const existingSession = db.prepare('SELECT ...').get(data.session_id);
  if (!existingSession) {
    db.prepare('INSERT INTO sessions ...').run(...);
  } else {
    db.prepare('UPDATE sessions ...').run(...);
  }
  if (data.ref) { /* affiliate INSERTs — same code */ }
  db.prepare('INSERT INTO page_views ...').run(...);
  db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...);
  db.prepare('UPDATE daily_stats ...').run(...);
});
processPageview();

What stays OUTSIDE the transaction (unchanged):

  • Input validation → 400 response
  • Site lookup (SELECT id FROM sites) → 404 response
  • UA parsing, geo header extraction → pure CPU, no DB

What goes INSIDE the transaction:

  • Session INSERT or UPDATE
  • Affiliate tracking (SELECT + conditional INSERT)
  • Page view INSERT
  • Daily stats UPSERT + visitor counting UPDATE

Why this is safe:

  1. All code is synchronousbetter-sqlite3 transactions require synchronous callbacks, and collect.js has zero async/await. Perfect fit.
  2. Same pattern already used in src/lib/gsc-sync.js (line 96) and src/lib/backup.js (line 46).
  3. Atomic rollback is strictly better — today a crash mid-sequence leaves orphaned rows. With a transaction, either all operations succeed or none do.
  4. SQL queries are identical — no logic changes, no query changes, just wrapping.

Verification:

  1. New session pageview → rows in sessions, page_views, daily_stats
  2. Existing session pageview → exit_page updated, page_count incremented ✓
  3. Affiliate pageview → affiliate_visits row created ✓
  4. Invalid site_id → 404, no transaction started ✓
  5. Dashboard loads correctly ✓

Rollback: Revert collect.js to original. Identical behavior.


Phase 3: Add write batching/buffering to /api/collect

Impact: ~10x write throughput | Risk: Medium | Files: 1

Buffer incoming pageview requests in memory and flush them in a single batched transaction every 100ms:

Request → validate site → parse UA → push to writeBuffer → respond 202 Accepted
                                              │
                               Timer (100ms) or buffer full (500 items)
                                              │
                                         flushBuffer()
                                              │
                               db.transaction(() => {
                                 for (item of batch) processOnePageview(db, item)
                               })

Key design decisions:

Decision Rationale
200 → 202 Accepted sendBeacon() is fire-and-forget, doesn't check status. XHR fallback in public/t.js also doesn't check. Change is invisible to all clients.
Site validation before buffering Invalid sites rejected immediately with 404. No garbage enters the buffer. Cheap indexed lookup (primary key).
UA parsing before buffering Buffer stores fully-parsed, ready-to-insert data. Keeps flush function simple.
writeBuffer.splice(0) for drain Atomically removes and returns all items. Safe in single-threaded Node.js — no race between push and drain.
MAX_BUFFER_SIZE = 500 Force-flush if buffer grows large. Prevents unbounded memory under extreme burst.
Graceful shutdown handlers SIGTERM/SIGINT/beforeExit flush the buffer before process exit. Zero data loss during deployments.
Error: log and drop Analytics is best-effort. If a batch transaction fails (disk full, corruption), retrying the same data would loop. Log the error with batch size for monitoring.

Verification:

  1. Single pageview → 202, appears in DB within 200ms ✓
  2. Burst of 10 rapid pageviews → all 202, all in DB ✓
  3. daily_stats.page_views incremented by 10 ✓
  4. Real tracking script on a page → works end-to-end ✓
  5. Kill dev server → no "items lost" error (graceful flush) ✓

Rollback: Replace with Phase 2 version (transaction-only). 202→200 transparent.


Phase 4: Add in-memory query cache for dashboard endpoints

Impact: ~100x dashboard speedup on cache hits | Risk: Low | Files: 7 (1 new + 6 modified)

Independent of Phase 3 — can be implemented after any earlier phase.

New file: src/lib/cache.js

A generic TTL cache with LRU eviction, following the same Map-based pattern already used in src/pages/api/analytics/[siteId]/flow.js (line 22):

import { createCache } from '@/lib/cache';
const cache = createCache({ ttlMs: 60000, maxEntries: 200 });

// In any endpoint handler, after auth check:
const cacheKey = cache.buildKey(siteId, req.query);
const cached = cache.get(cacheKey);
if (cached) return res.status(200).json(cached);

// ... existing query logic (unchanged) ...

cache.set(cacheKey, result);
res.status(200).json(result);

Cache configuration per endpoint:

Endpoint TTL Max Entries Rationale
overview.js 60s 200 Heaviest endpoint (15-25 queries). Dashboard data doesn't change per-second.
realtime.js 10s 50 Must stay fresh. Still eliminates duplicate concurrent requests (multiple tabs).
conversions.js 30s 200 Medium weight. Eliminates the N+1 enrichment loop on repeated loads.
affiliates.js 60s 100 Aggregate data. Changes slowly.
affiliates/[affiliateId].js 60s 100 Detail view with 7 queries.
visitor-journey.js 120s 100 Historical data. Rarely changes.
flow.js Already cached (60s LRU). Do not modify.

Cache key design:

  • Format: {siteId}|{sorted query params}
  • buildKey(siteId, req.query) sorts query keys alphabetically so ?period=30d&tab=sources and ?tab=sources&period=30d hit the same cache entry
  • siteId is always part of the key (different sites never share cache)

Memory bounds:

  • Each cache entry is a JSON response object, typically 5-50KB
  • With maxEntries: 200, worst case per cache instance: ~10MB
  • 6 cache instances × 10MB = ~60MB total worst case
  • LRU eviction kicks in when maxEntries is reached

Verification:

  1. Hit overview twice → second request ~100x faster ✓
  2. Compare response bodies → identical JSON ✓
  3. Wait 60s, request again → fresh data (cache expired) ✓
  4. Different params → separate cache entries ✓
  5. flow.js unaffected (its own cache still works) ✓

Rollback: Remove 3-5 lines per endpoint. Delete cache.js.


Summary

Phase What Files Changed New Deps Schema Change Breaking Change Throughput Impact
1 Performance pragmas 1 0 None None 2-3x
2 Transaction wrapping 1 0 None None 3-5x
3 Write batching 1 0 None 200→202 (invisible) ~10x
4 Query caching 7 (1 new) 0 None None ~100x reads

Combined result: ~150-300 pageviews/sec → ~2,000-5,000 pageviews/sec. Dashboard queries from ~50-200ms → ~1ms on cache hit.

Phase dependencies

Phase 1 (pragmas) → Phase 2 (transaction) → Phase 3 (batching)
                                              │
Phase 4 (caching) ─── independent ────────────┘

Each phase is independently deployable and can be merged as a separate PR. Phase 4 has no dependency on Phase 3 — it can be implemented at any point.


Environment

  • Database: SQLite via better-sqlite3
  • Runtime: Node.js 20+
  • Framework: Next.js 16 (Pages Router)
  • Relevant files:
    • src/lib/db.js — database connection and pragma configuration
    • src/pages/api/collect.js — analytics data collection endpoint
    • src/pages/api/analytics/[siteId]/*.js — dashboard query endpoints
    • src/lib/gsc-sync.js line 96 — existing db.transaction() usage (reference)
    • src/lib/backup.js line 46 — existing db.transaction() usage (reference)
    • src/pages/api/analytics/[siteId]/flow.js line 22 — existing Map LRU cache (reference)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions