Performance Improvement Suggestions

# SQLite write contention in `/api/collect` causes performance degradation under concurrent traffic

**Labels:** `bug`, `performance`, `database`, `help wanted`, `good first issue`

---

## Description

The `/api/collect` endpoint — the core analytics data collection path that every tracked pageview hits — executes **7-9 independent database operations per request** without wrapping them in a transaction. Each operation independently acquires and releases the SQLite write lock.

Under concurrent traffic (100+ simultaneous visitors generating pageviews), this creates **write lock thrashing** where requests compete for the database lock on every individual statement, severely limiting throughput. Additionally, the lack of a transaction boundary means a crash or error mid-sequence can leave **inconsistent data** in the database (e.g., a session row without its corresponding page_view row).

A secondary issue: `src/lib/db.js` only configures 3 SQLite pragmas, leaving significant performance on the table with suboptimal defaults for `synchronous`, `cache_size`, `mmap_size`, and `temp_store`.

---

## Current Behavior

### 1. Lock acquisition pattern in `/api/collect` (`src/pages/api/collect.js`)

Every single pageview request executes this sequence of **independent** database operations:

```
Request arrives at /api/collect
│
├── Step 1:  SELECT id FROM sites WHERE id = ?                          ← read
├── Step 2:  SELECT id, page_count FROM sessions WHERE id = ?           ← read
│
├── Step 3:  INSERT INTO sessions (22 columns) VALUES (...)             ← WRITE LOCK #1
│            OR
│            UPDATE sessions SET exit_page, last_activity, ...          ← WRITE LOCK #1
│
├── Step 4:  SELECT id FROM affiliates WHERE site_id = ? AND slug = ?   ← read (conditional)
├── Step 5:  SELECT id FROM affiliate_visits WHERE ...                  ← read (conditional)
├── Step 6:  INSERT INTO affiliate_visits (...)                         ← WRITE LOCK #2 (conditional)
│
├── Step 7:  INSERT INTO page_views (7 columns) VALUES (...)            ← WRITE LOCK #3
├── Step 8:  INSERT INTO daily_stats (...) ON CONFLICT DO UPDATE ...    ← WRITE LOCK #4
│
├── Step 9:  SELECT 1 FROM sessions WHERE ... (visitor today check)     ← read (new sessions only)
└── Step 10: UPDATE daily_stats SET sessions = ..., visitors = ...      ← WRITE LOCK #5 (new sessions only)
```

**Each write lock is acquired and released independently** (autocommit). Under concurrent load:

- Request A acquires write lock for `INSERT INTO sessions`, releases it
- Request B acquires write lock for `INSERT INTO sessions`, releases it
- Request A acquires write lock for `INSERT INTO page_views`, releases it
- Request B gets `SQLITE_BUSY`, waits up to 5 seconds (`busy_timeout`)
- Request A acquires write lock for `UPSERT daily_stats`, releases it
- ... and so on

This interleaving multiplies the total lock contention by 4-5x compared to batching all writes in a single lock acquisition.

### 2. Partial write risk

If the Node.js process crashes (OOM, SIGKILL, hardware fault) or an error is thrown after `INSERT INTO sessions` (Step 3) but before `INSERT INTO page_views` (Step 7), the database is left in an **inconsistent state**:

- A session row exists with no corresponding page_view
- `daily_stats` may not have been updated
- Affiliate visits may be partially recorded

There is no way to detect or repair these orphaned records.

### 3. Missing SQLite pragmas in `src/lib/db.js`

Current configuration (3 pragmas):

```js
db.pragma('journal_mode = WAL');     // Good — enables concurrent reads
db.pragma('foreign_keys = ON');       // Good — data integrity
db.pragma('busy_timeout = 5000');     // Good — 5s retry on busy
```

Missing pragmas that leave significant performance on the table:

| Pragma | Default | Optimal for this workload | Impact |
|--------|---------|--------------------------|--------|
| `synchronous` | `FULL` (fsync every commit) | `NORMAL` | **2-5x write speedup**. Safe with WAL mode — WAL file provides durability guarantee. The only theoretical risk is data loss during an OS crash at the exact moment of a WAL checkpoint, which is recoverable via the nightly `aggregate.js` cron job. |
| `cache_size` | `~2MB` (-2000 pages) | `-64000` (64MB) | More DB pages stay in RAM. Reduces disk I/O for the 15-25 queries in `overview.js` and the indexed lookups in `collect.js`. |
| `mmap_size` | `0` (disabled) | `268435456` (256MB) | Memory-maps the DB file. The OS virtual memory system handles page access, which is faster than SQLite's read-through-cache for read-heavy workloads. Well-supported on macOS and Linux. |
| `temp_store` | `DISK` | `MEMORY` | Temporary tables and indices (used internally by SQLite for `ORDER BY`, `GROUP BY`, `DISTINCT`, subqueries) stored in RAM instead of disk. |

### 4. No dashboard query caching

All analytics dashboard endpoints (`src/pages/api/analytics/[siteId]/*.js`) execute full SQL queries on every request:

| Endpoint | Queries per request | Current caching |
|----------|-------------------|-----------------|
| `overview.js` | 15-25 queries | None |
| `realtime.js` | 1 query (with subquery) | None |
| `conversions.js` | 5 queries + N+1 enrichment loop | None |
| `affiliates.js` | 2 queries with subquery JOINs | None |
| `affiliates/[affiliateId].js` | 7 queries | None |
| `visitor-journey.js` | 4 queries | None |
| `flow.js` | 1-2 queries | **60s LRU cache (already implemented)** |

The `overview.js` endpoint is the heaviest — it runs up to 25 SQL queries including `COUNT(DISTINCT visitor_id)`, multi-table JOINs, and `GROUP BY` aggregations. Every dashboard page load triggers all of them, even if the data hasn't changed in the last minute.

---

## Expected Behavior

1. All database writes in `/api/collect` should execute inside a **single transaction** — one lock acquisition, atomic commit/rollback
2. SQLite should be configured with **optimal pragmas** for an analytics write-heavy + dashboard read-heavy workload
3. High-traffic scenarios (100+ concurrent pageviews/sec) should be handled via **write batching** to amortize transaction overhead
4. Dashboard queries should be **cached in memory** with appropriate TTLs to avoid redundant SQL execution

---

## Impact

### Estimated throughput limits

| Scenario | Current | With all fixes |
|----------|---------|---------------|
| Sustained pageviews/sec | ~150-300 | ~2,000-5,000 |
| Peak concurrent users | ~500 | ~2,000-5,000 |
| Dashboard overview load time | ~50-200ms | ~1ms (cache hit) |
| Write latency per pageview | ~5-10ms (7-9 autocommits) | ~0ms (buffered, async flush) |

### Who is affected

Any self-hosted deployment with more than a few hundred concurrent visitors across all tracked sites. The bottleneck is per-server (single SQLite file), not per-site.

---

## Reproduction

### Simulating lock contention

```bash
# Start the dev server
yarn dev

# In another terminal, send 50 concurrent pageviews using GNU parallel or xargs
# (requires a valid site_id in your database — replace 1 with your site ID)

seq 1 50 | xargs -P 50 -I {} curl -s -o /dev/null -w "req {} → %{http_code} in %{time_total}s\n" \
  -X POST http://localhost:3000/api/collect \
  -H 'Content-Type: application/json' \
  -d '{"site_id":1,"visitor_id":"load_test_{}","session_id":"load_sess_{}","type":"pageview","pathname":"/load-test-{}","url":"http://example.com/load-test-{}","hostname":"example.com"}'
```

**What you'll observe:**
- First few requests complete in ~5-20ms
- Later requests queue up, taking 100-500ms+ as they wait for the busy timeout
- Under extreme load, some requests may hit the 5-second `busy_timeout` and fail with 500

### Verifying inconsistent data (partial write)

```bash
# Start the dev server, then kill it mid-request:
# 1. Send a slow burst of pageviews
# 2. While they're processing, kill the Node.js process: kill -9 <pid>
# 3. Check the database:

sqlite3 data/analytics.db "
  SELECT s.id, s.page_count, COUNT(pv.id) as actual_page_views
  FROM sessions s
  LEFT JOIN page_views pv ON pv.session_id = s.id
  GROUP BY s.id
  HAVING s.page_count != actual_page_views;
"
# Any rows returned indicate inconsistent data from partial writes
```

---

## Proposed Solution

A 4-phase optimization plan, each phase independently deployable and verifiable. No breaking changes, no new dependencies, no schema migrations.

### Phase 1: Add performance pragmas to `src/lib/db.js`

**Impact: 2-3x read/write speedup | Risk: Very Low | Files: 1**

Add 4 pragmas after the existing 3 in `getDb()`:

```js
// Existing
db.pragma('journal_mode = WAL');
db.pragma('foreign_keys = ON');
db.pragma('busy_timeout = 5000');

// Add these
db.pragma('synchronous = NORMAL');      // Skip fsync per commit (safe with WAL)
db.pragma('cache_size = -64000');        // 64MB page cache (default ~2MB)
db.pragma('mmap_size = 268435456');      // Memory-map up to 256MB of DB file
db.pragma('temp_store = MEMORY');        // Temp tables/sorts in RAM
```

**Why each pragma is safe:**

- `synchronous = NORMAL`: With WAL already enabled, `NORMAL` is the [SQLite-recommended setting](https://www.sqlite.org/pragma.html#pragma_synchronous). The WAL file itself provides the durability guarantee. The only risk is data loss during an OS-level crash at the exact moment of a WAL checkpoint — recoverable via the nightly `cron/aggregate.js` recomputation.
- `cache_size = -64000`: Memory-only change. Reverts to default on restart. 64MB is modest for a server process.
- `mmap_size = 268435456`: Read-only optimization. If the DB file is smaller than 256MB, only the actual file size is mapped. No effect on writes.
- `temp_store = MEMORY`: Affects only SQLite-internal temporary tables. No impact on data or schema.

**Verification:**
1. `yarn dev` — app starts, no errors
2. Dashboard loads, all data renders
3. Send a pageview via curl — 200 response, data persists

**Rollback:** Remove the 4 lines. No data changes.

---

### Phase 2: Wrap `/api/collect` in a single transaction

**Impact: 4-5x reduced lock contention | Risk: Low | Files: 1**

Wrap lines 60-171 of `src/pages/api/collect.js` in `db.transaction()`:

```js
// BEFORE — 7-9 independent autocommits
const existingSession = db.prepare('SELECT ...').get(data.session_id);
if (!existingSession) {
  db.prepare('INSERT INTO sessions ...').run(...);       // autocommit #1
} else {
  db.prepare('UPDATE sessions ...').run(...);            // autocommit #1
}
if (data.ref) { /* affiliate INSERTs */ }                // autocommit #2
db.prepare('INSERT INTO page_views ...').run(...);       // autocommit #3
db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...);  // autocommit #4
db.prepare('UPDATE daily_stats ...').run(...);            // autocommit #5


// AFTER — single transaction, single lock acquisition
const processPageview = db.transaction(() => {
  const existingSession = db.prepare('SELECT ...').get(data.session_id);
  if (!existingSession) {
    db.prepare('INSERT INTO sessions ...').run(...);
  } else {
    db.prepare('UPDATE sessions ...').run(...);
  }
  if (data.ref) { /* affiliate INSERTs — same code */ }
  db.prepare('INSERT INTO page_views ...').run(...);
  db.prepare('INSERT INTO daily_stats ... ON CONFLICT').run(...);
  db.prepare('UPDATE daily_stats ...').run(...);
});
processPageview();
```

**What stays OUTSIDE the transaction (unchanged):**
- Input validation → `400` response
- Site lookup (`SELECT id FROM sites`) → `404` response
- UA parsing, geo header extraction → pure CPU, no DB

**What goes INSIDE the transaction:**
- Session INSERT or UPDATE
- Affiliate tracking (SELECT + conditional INSERT)
- Page view INSERT
- Daily stats UPSERT + visitor counting UPDATE

**Why this is safe:**
1. All code is **synchronous** — `better-sqlite3` transactions require synchronous callbacks, and `collect.js` has zero async/await. Perfect fit.
2. **Same pattern** already used in `src/lib/gsc-sync.js` (line 96) and `src/lib/backup.js` (line 46).
3. **Atomic rollback** is strictly better — today a crash mid-sequence leaves orphaned rows. With a transaction, either all operations succeed or none do.
4. **SQL queries are identical** — no logic changes, no query changes, just wrapping.

**Verification:**
1. New session pageview → rows in `sessions`, `page_views`, `daily_stats` ✓
2. Existing session pageview → `exit_page` updated, `page_count` incremented ✓
3. Affiliate pageview → `affiliate_visits` row created ✓
4. Invalid `site_id` → 404, no transaction started ✓
5. Dashboard loads correctly ✓

**Rollback:** Revert `collect.js` to original. Identical behavior.

---

### Phase 3: Add write batching/buffering to `/api/collect`

**Impact: ~10x write throughput | Risk: Medium | Files: 1**

Buffer incoming pageview requests in memory and flush them in a single batched transaction every 100ms:

```
Request → validate site → parse UA → push to writeBuffer → respond 202 Accepted
                                              │
                               Timer (100ms) or buffer full (500 items)
                                              │
                                         flushBuffer()
                                              │
                               db.transaction(() => {
                                 for (item of batch) processOnePageview(db, item)
                               })
```

**Key design decisions:**

| Decision | Rationale |
|----------|-----------|
| **200 → 202 Accepted** | `sendBeacon()` is fire-and-forget, doesn't check status. XHR fallback in `public/t.js` also doesn't check. Change is invisible to all clients. |
| **Site validation before buffering** | Invalid sites rejected immediately with 404. No garbage enters the buffer. Cheap indexed lookup (primary key). |
| **UA parsing before buffering** | Buffer stores fully-parsed, ready-to-insert data. Keeps flush function simple. |
| **`writeBuffer.splice(0)` for drain** | Atomically removes and returns all items. Safe in single-threaded Node.js — no race between push and drain. |
| **MAX_BUFFER_SIZE = 500** | Force-flush if buffer grows large. Prevents unbounded memory under extreme burst. |
| **Graceful shutdown handlers** | `SIGTERM`/`SIGINT`/`beforeExit` flush the buffer before process exit. Zero data loss during deployments. |
| **Error: log and drop** | Analytics is best-effort. If a batch transaction fails (disk full, corruption), retrying the same data would loop. Log the error with batch size for monitoring. |

**Verification:**
1. Single pageview → 202, appears in DB within 200ms ✓
2. Burst of 10 rapid pageviews → all 202, all in DB ✓
3. `daily_stats.page_views` incremented by 10 ✓
4. Real tracking script on a page → works end-to-end ✓
5. Kill dev server → no "items lost" error (graceful flush) ✓

**Rollback:** Replace with Phase 2 version (transaction-only). 202→200 transparent.

---

### Phase 4: Add in-memory query cache for dashboard endpoints

**Impact: ~100x dashboard speedup on cache hits | Risk: Low | Files: 7 (1 new + 6 modified)**

**Independent of Phase 3** — can be implemented after any earlier phase.

**New file: `src/lib/cache.js`**

A generic TTL cache with LRU eviction, following the same `Map`-based pattern already used in `src/pages/api/analytics/[siteId]/flow.js` (line 22):

```js
import { createCache } from '@/lib/cache';
const cache = createCache({ ttlMs: 60000, maxEntries: 200 });

// In any endpoint handler, after auth check:
const cacheKey = cache.buildKey(siteId, req.query);
const cached = cache.get(cacheKey);
if (cached) return res.status(200).json(cached);

// ... existing query logic (unchanged) ...

cache.set(cacheKey, result);
res.status(200).json(result);
```

**Cache configuration per endpoint:**

| Endpoint | TTL | Max Entries | Rationale |
|----------|-----|-------------|-----------|
| `overview.js` | 60s | 200 | Heaviest endpoint (15-25 queries). Dashboard data doesn't change per-second. |
| `realtime.js` | 10s | 50 | Must stay fresh. Still eliminates duplicate concurrent requests (multiple tabs). |
| `conversions.js` | 30s | 200 | Medium weight. Eliminates the N+1 enrichment loop on repeated loads. |
| `affiliates.js` | 60s | 100 | Aggregate data. Changes slowly. |
| `affiliates/[affiliateId].js` | 60s | 100 | Detail view with 7 queries. |
| `visitor-journey.js` | 120s | 100 | Historical data. Rarely changes. |
| `flow.js` | — | — | **Already cached** (60s LRU). Do not modify. |

**Cache key design:**
- Format: `{siteId}|{sorted query params}`
- `buildKey(siteId, req.query)` sorts query keys alphabetically so `?period=30d&tab=sources` and `?tab=sources&period=30d` hit the same cache entry
- `siteId` is always part of the key (different sites never share cache)

**Memory bounds:**
- Each cache entry is a JSON response object, typically 5-50KB
- With `maxEntries: 200`, worst case per cache instance: ~10MB
- 6 cache instances × 10MB = ~60MB total worst case
- LRU eviction kicks in when `maxEntries` is reached

**Verification:**
1. Hit overview twice → second request ~100x faster ✓
2. Compare response bodies → identical JSON ✓
3. Wait 60s, request again → fresh data (cache expired) ✓
4. Different params → separate cache entries ✓
5. `flow.js` unaffected (its own cache still works) ✓

**Rollback:** Remove 3-5 lines per endpoint. Delete `cache.js`.

---

## Summary

| Phase | What | Files Changed | New Deps | Schema Change | Breaking Change | Throughput Impact |
|-------|------|--------------|----------|---------------|-----------------|-------------------|
| 1 | Performance pragmas | 1 | 0 | None | None | 2-3x |
| 2 | Transaction wrapping | 1 | 0 | None | None | 3-5x |
| 3 | Write batching | 1 | 0 | None | 200→202 (invisible) | ~10x |
| 4 | Query caching | 7 (1 new) | 0 | None | None | ~100x reads |

**Combined result:** ~150-300 pageviews/sec → ~2,000-5,000 pageviews/sec. Dashboard queries from ~50-200ms → ~1ms on cache hit.

### Phase dependencies

```
Phase 1 (pragmas) → Phase 2 (transaction) → Phase 3 (batching)
                                              │
Phase 4 (caching) ─── independent ────────────┘
```

Each phase is independently deployable and can be merged as a separate PR. Phase 4 has no dependency on Phase 3 — it can be implemented at any point.

---

## Environment

- **Database:** SQLite via `better-sqlite3`
- **Runtime:** Node.js 20+
- **Framework:** Next.js 16 (Pages Router)
- **Relevant files:**
  - `src/lib/db.js` — database connection and pragma configuration
  - `src/pages/api/collect.js` — analytics data collection endpoint
  - `src/pages/api/analytics/[siteId]/*.js` — dashboard query endpoints
  - `src/lib/gsc-sync.js` line 96 — existing `db.transaction()` usage (reference)
  - `src/lib/backup.js` line 46 — existing `db.transaction()` usage (reference)
  - `src/pages/api/analytics/[siteId]/flow.js` line 22 — existing `Map` LRU cache (reference)


Pragma	Default	Optimal for this workload	Impact
`synchronous`	`FULL` (fsync every commit)	`NORMAL`	2-5x write speedup. Safe with WAL mode — WAL file provides durability guarantee. The only theoretical risk is data loss during an OS crash at the exact moment of a WAL checkpoint, which is recoverable via the nightly `aggregate.js` cron job.
`cache_size`	`~2MB` (-2000 pages)	`-64000` (64MB)	More DB pages stay in RAM. Reduces disk I/O for the 15-25 queries in `overview.js` and the indexed lookups in `collect.js`.
`mmap_size`	`0` (disabled)	`268435456` (256MB)	Memory-maps the DB file. The OS virtual memory system handles page access, which is faster than SQLite's read-through-cache for read-heavy workloads. Well-supported on macOS and Linux.
`temp_store`	`DISK`	`MEMORY`	Temporary tables and indices (used internally by SQLite for `ORDER BY`, `GROUP BY`, `DISTINCT`, subqueries) stored in RAM instead of disk.

Decision	Rationale
200 → 202 Accepted	`sendBeacon()` is fire-and-forget, doesn't check status. XHR fallback in `public/t.js` also doesn't check. Change is invisible to all clients.
Site validation before buffering	Invalid sites rejected immediately with 404. No garbage enters the buffer. Cheap indexed lookup (primary key).
UA parsing before buffering	Buffer stores fully-parsed, ready-to-insert data. Keeps flush function simple.
`writeBuffer.splice(0)` for drain	Atomically removes and returns all items. Safe in single-threaded Node.js — no race between push and drain.
MAX_BUFFER_SIZE = 500	Force-flush if buffer grows large. Prevents unbounded memory under extreme burst.
Graceful shutdown handlers	`SIGTERM`/`SIGINT`/`beforeExit` flush the buffer before process exit. Zero data loss during deployments.
Error: log and drop	Analytics is best-effort. If a batch transaction fails (disk full, corruption), retrying the same data would loop. Log the error with batch size for monitoring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvement Suggestions #4

SQLite write contention in `/api/collect` causes performance degradation under concurrent traffic

Description

Current Behavior

1. Lock acquisition pattern in `/api/collect` (`src/pages/api/collect.js`)

2. Partial write risk

3. Missing SQLite pragmas in `src/lib/db.js`

4. No dashboard query caching

Expected Behavior

Impact

Estimated throughput limits

Who is affected

Reproduction

Simulating lock contention

Verifying inconsistent data (partial write)

Proposed Solution

Phase 1: Add performance pragmas to `src/lib/db.js`

Phase 2: Wrap `/api/collect` in a single transaction

Phase 3: Add write batching/buffering to `/api/collect`

Phase 4: Add in-memory query cache for dashboard endpoints

Summary

Phase dependencies

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Endpoint	Queries per request	Current caching
`overview.js`	15-25 queries	None
`realtime.js`	1 query (with subquery)	None
`conversions.js`	5 queries + N+1 enrichment loop	None
`affiliates.js`	2 queries with subquery JOINs	None
`affiliates/[affiliateId].js`	7 queries	None
`visitor-journey.js`	4 queries	None
`flow.js`	1-2 queries	60s LRU cache (already implemented)

Scenario	Current	With all fixes
Sustained pageviews/sec	~150-300	~2,000-5,000
Peak concurrent users	~500	~2,000-5,000
Dashboard overview load time	~50-200ms	~1ms (cache hit)
Write latency per pageview	~5-10ms (7-9 autocommits)	~0ms (buffered, async flush)

Endpoint	TTL	Max Entries	Rationale
`overview.js`	60s	200	Heaviest endpoint (15-25 queries). Dashboard data doesn't change per-second.
`realtime.js`	10s	50	Must stay fresh. Still eliminates duplicate concurrent requests (multiple tabs).
`conversions.js`	30s	200	Medium weight. Eliminates the N+1 enrichment loop on repeated loads.
`affiliates.js`	60s	100	Aggregate data. Changes slowly.
`affiliates/[affiliateId].js`	60s	100	Detail view with 7 queries.
`visitor-journey.js`	120s	100	Historical data. Rarely changes.
`flow.js`	—	—	Already cached (60s LRU). Do not modify.

Phase	What	Files Changed	Schema Change	Breaking Change	Throughput Impact
1	Performance pragmas	1	None	None	2-3x
2	Transaction wrapping	1	None	None	3-5x
3	Write batching	1	None	200→202 (invisible)	~10x
4	Query caching	7 (1 new)	None	None	~100x reads

Performance Improvement Suggestions #4

Description

SQLite write contention in /api/collect causes performance degradation under concurrent traffic

Description

Current Behavior

1. Lock acquisition pattern in /api/collect (src/pages/api/collect.js)

2. Partial write risk

3. Missing SQLite pragmas in src/lib/db.js

4. No dashboard query caching

Expected Behavior

Impact

Estimated throughput limits

Who is affected

Reproduction

Simulating lock contention

Verifying inconsistent data (partial write)

Proposed Solution

Phase 1: Add performance pragmas to src/lib/db.js

Phase 2: Wrap /api/collect in a single transaction

Phase 3: Add write batching/buffering to /api/collect

Phase 4: Add in-memory query cache for dashboard endpoints

Summary

Phase dependencies

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

SQLite write contention in `/api/collect` causes performance degradation under concurrent traffic

1. Lock acquisition pattern in `/api/collect` (`src/pages/api/collect.js`)

3. Missing SQLite pragmas in `src/lib/db.js`

Phase 1: Add performance pragmas to `src/lib/db.js`

Phase 2: Wrap `/api/collect` in a single transaction

Phase 3: Add write batching/buffering to `/api/collect`