feat(cache): add distributed cache-fill coordinator foundation in fork by lspassos1 · Pull Request #19 · lspassos1/worldmonitor

lspassos1 · 2026-04-12T15:16:05Z

Summary

This adds the fork-first distributed cache-fill coordinator foundation without enabling any production keys yet. The runtime now has a miss-only distributed singleflight primitive, body-based Redis publish commands, and generated cache-fill policy support, while the generated allowlist remains empty in this slice.

Root cause

cachedFetchJson() only coalesced cache misses inside a single process, so cross-instance cold misses could still stampede upstream fetches. At the same time, setCachedJson() still published JSON payloads through the URL path, which is the wrong transport for larger values and the wrong base for reliable leader publication.

Changes

extend DatasetContract.cacheFill with pollMinMs and pollMaxMs
add strict generator validation plus the new server/_shared/_generated/cache-fill-registry.ts artifact
move Redis single-command writes to JSON-body requests and keep pipeline normalization isolated
add the shared distributed coordinator in server/_shared/redis.ts with mandatory recheck, follower polling, guarded hedge path, token-safe unlock, and structured cache-fill logs
cover coordinator behavior and body-based Redis compatibility in the test suite

Validation

npm run registry:generate
npm run registry:check
node --test tests/redis-caching.test.mjs
npm run typecheck
npm run typecheck:api
npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts

Risk

Low to moderate. This changes shared cache runtime code, but it is still disabled for real keys in this PR because the generated allowlist is empty. Redis coordination failures explicitly degrade to the existing local singleflight behavior.

Type of change

Affected areas

Checklist

Tested on worldmonitor.app variant
Tested on tech.worldmonitor.app variant (if applicable)
New RSS feed domains added to api/rss-proxy.js allowlist (if adding feeds)
No API keys or secrets committed
TypeScript compiles without errors (npm run typecheck)

Screenshots

Not applicable.

Refs #15
Depends on #14

Root cause: Redis miss coalescing only worked inside a single process and setCachedJson still published large payloads through the URL path. Changes: - add generated cache-fill policy support and validation to the dataset registry generator - move Redis publish commands to JSON body requests and add a dedicated single-command helper - add the shared distributed singleflight runtime with mandatory recheck, token-safe unlock, and structured logs - cover coordinator behavior and body-based Redis compatibility in tests Validation: - npm run registry:generate - npm run registry:check - node --test tests/redis-caching.test.mjs - npm run typecheck - npm run typecheck:api - npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts Refs #15

vercel · 2026-04-12T15:16:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
worldmonitor	Ready	Preview, Comment	Apr 13, 2026 0:37am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 639563c56e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-12T15:20:20Z

+  const existing = inflight.get(key);
+  if (existing) {
+    const shared = await existing;
+    return { data: shared.data as T | null, source: 'fresh' };


Propagate actual source for coalesced cache-fill joiners

When cache-fill coordination is enabled, the in-flight promise can now resolve with source: 'cache' (for example, leader lock acquisition rechecks Redis and finds a value from another instance), but this joiner path always rewrites the source to 'fresh'. That mislabels concurrent callers as fresh even when no upstream fetch ran, which can incorrectly trigger fresh-only behavior in downstream handlers (such as metadata updates or provider attribution). Return shared.source here instead of hardcoding 'fresh'.

Useful? React with 👍 / 👎.

greptile-apps · 2026-04-12T15:23:46Z

Greptile Summary

This PR lands the distributed cache-fill coordinator foundation by extending DatasetContract.cacheFill with polling parameters, adding strict generator validation, migrating Redis single-command writes to JSON-body requests via the new runRedisCommand helper, and implementing the full coordinator in server/_shared/redis.ts (miss-only, allowlist-only, token-safe lock, mandatory leader recheck, follower polling with jitter, hedge fallback, and structured logging). It also consolidates the parseRedisCommand test helper into tests/helpers/fake-upstash-redis.mts, resolving the previously flagged duplication.

Key points:

Coordinator is live on merge. Both infra:service-statuses:v1 and risk:scores:sebuf:v1 are enabled: true in CACHE_FILL_POLICIES and emitted into the generated registry. This contradicts the PR description ("allowlist remains empty"), but the code and tests are consistent with one another — the description appears to be stale from an earlier draft.
Documentation timing values for serviceStatuses are incorrect. cache-fill-coordinator.md documents leaseMs=12000, waitMs=3000 but the committed code has leaseMs=15000, waitMs=11000. The 8-second waitMs discrepancy is material: followers block up to 11 seconds on a cold miss before falling back to the stale module cache.
startedAt is captured before the lock acquisition attempt, so the follower's effective poll window is waitMs − networkRoundTrip (up to 1,500 ms less). This is fine for riskScoresLive (waitMs: 4,000) and only a minor accounting concern for serviceStatuses (waitMs: 11,000), but worth documenting.
Token-safe unlock (EVAL script) and the normalizeRedisCommand EVAL key-index fix are both present and tested. The new pipeline EVAL test confirms that numkeys is used to locate keys rather than the default index 1.
Previously flagged issues: The parseRedisCommand helper is now shared (duplication resolved). The lock-key double-prefix note and the EVAL key-index warning from prior threads are acknowledged in the implementation (raw: true on all lock commands; the EVAL branch in normalizeRedisCommand now correctly derives key positions from numkeys).

Confidence Score: 4/5

Safe to merge; coordinator is live for two low-risk keys with handler-level stale fallbacks, and the core locking logic is correct and well-tested.

The implementation is solid: token-safe unlock, mandatory leader recheck, jitter polling, and graceful degradation on Redis errors are all present and covered by integration tests. Previous review concerns (EVAL key index, parseRedisCommand duplication) have been addressed. The only new findings are a stale timing entry in the docs and a minor startedAt accounting note — neither affects correctness. Score reflects one concrete doc fix remaining before full confidence.

docs/architecture/cache-fill-coordinator.md — timing values for serviceStatuses need updating to match the committed code.

Important Files Changed

Filename	Overview
server/_shared/redis.ts	Core change: adds `runRedisCommand` for body-based Redis calls, migrates `setCachedJson`/`deleteRedisKey` to it, and introduces the full distributed coordinator (lock acquire, mandatory recheck, follower polling, hedge, token-safe unlock). Logic is sound; minor concern about `startedAt` being captured before lock acquisition, which slightly shrinks the follower poll window.
registry/datasets.ts	Adds `pollMinMs`/`pollMaxMs` to `DatasetContract.cacheFill` and populates two enabled `CACHE_FILL_POLICIES` entries (`serviceStatuses` / `riskScoresLive`). Both are `enabled: true` and emitted into the generated registry, making the coordinator live on merge — consistent with the code but contradicting the PR description.
docs/architecture/cache-fill-coordinator.md	New architecture doc; the "Initial policy values" table at lines 58–60 lists stale timing values for `serviceStatuses` (`leaseMs=12000, waitMs=3000`) that don't match the committed code (`leaseMs=15000, waitMs=11000`).
server/_shared/_generated/cache-fill-registry.ts	Auto-generated artifact; two entries for `infra:service-statuses:v1` and `risk:scores:sebuf:v1` with correct structure. Timing values match `registry/datasets.ts` (source of truth) — the mismatch is only in the markdown docs.
tests/helpers/fake-upstash-redis.mts	Significant expansion: adds `parseRedisCommand` (now the shared helper both test files import), `executeCommand` dispatcher, expiry-aware `readString`/`writeString`/`deleteKey`, and full NX/EX/PX/EVAL support. Resolves the previous duplication concern.
tests/redis-caching.test.mjs	Adds eight coordinator integration tests covering leader/follower coalescing, mandatory recheck, sentinel propagation, timeout fallbacks, hedge retry, token-safe unlock, invariant enforcement, and handler-level stale-cache fallbacks. Good coverage; `parseRedisCommand` is now imported from the shared helper.
scripts/generate-dataset-registry.ts	Adds strict validation for `pollMinMs`/`pollMaxMs` invariants and a new `bootstrap.tier` guard. Emits `cache-fill-registry.ts` correctly; validation logic matches the runtime invariants in `resolveCacheMiss`.
scripts/check-dataset-registry.mjs	Adds `cache-fill-registry.ts` to the `git diff --exit-code` staleness check, ensuring CI catches drift between source and generated artifact.
tests/stock-analysis-history.test.mts	Migrates Redis GET/SET parsing to the shared `parseRedisCommand` helper, replacing inline path-split logic. No behavior change; resolves the previously flagged duplication.
tests/stock-backtest.test.mts	Same import migration as `stock-analysis-history.test.mts` — adopts `parseRedisCommand` from the shared helper. No behavior change.

Sequence Diagram

sequenceDiagram
    participant CA as Instance A (Leader)
    participant CB as Instance B (Follower)
    participant R as Redis

    Note over CA,CB: Both instances observe a cache miss

    CA->>R: GET prefixKey(key) → null
    CB->>R: GET prefixKey(key) → null

    CA->>R: SET lockKey tokenA NX PX leaseMs → OK
    CB->>R: SET lockKey tokenB NX PX leaseMs → null (lock held)

    Note over CA: Leader: recheck before fetching
    CA->>R: GET prefixKey(key) → null (still miss)
    CA->>CA: run fetcher()
    CA->>R: SET prefixKey(key) JSON(result) EX ttl
    CA->>R: EVAL token-safe unlock → DEL lockKey

    Note over CB: Follower: poll with jitter until value or deadline
    loop poll (pollMinMs–pollMaxMs jitter)
        CB->>R: GET prefixKey(key) → null or value
    end
    CB->>CB: return observed value (source: cache)

    alt Follower times out (fallback=return_null)
        CB->>CB: return null → handler uses stale cache
    else Follower times out (fallback=hedge)
        CB->>R: SET lockKey tokenB2 NX PX leaseMs
        alt Hedge lock acquired
            CB->>R: GET prefixKey(key) recheck
            CB->>CB: run fetcher() if still miss
        else Hedge lock not acquired
            CB->>CB: return null
        end
    else Follower times out (fallback=throw)
        CB->>CB: throw cache-fill timeout error
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/architecture/cache-fill-coordinator.md
Line: 58-60

Comment:
**Timing values in docs don't match the committed code**

The documentation lists different timing values from what is actually committed in `registry/datasets.ts`:

| Field | Doc says | Code has |
|---|---|---|
| `serviceStatuses.leaseMs` | 12000 | 15000 |
| `serviceStatuses.waitMs` | 3000 | **11000** |

The `waitMs` discrepancy is material: followers will poll for up to **11 seconds** (not 3), meaning API requests to `list-service-statuses` that hit a cold miss while another instance is the leader could block for up to 11 seconds before falling back to the stale module cache. The 4× difference makes the doc actively misleading when triaging coordinator timeout logs.

```suggestion
- `serviceStatuses`: `leaseMs=15000`, `waitMs=11000`, `pollMinMs=75`, `pollMaxMs=175`, `fallback='return_null'`
- `riskScoresLive`: `leaseMs=15000`, `waitMs=4000`, `pollMinMs=100`, `pollMaxMs=250`, `fallback='return_null'`
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: server/_shared/redis.ts
Line: 404-421

Comment:
**`waitMs` budget already consumed by lock acquisition when follower starts polling**

`startedAt` is captured before the lock acquisition attempt (line 359). `tryAcquireFillLock` can take up to `REDIS_OP_TIMEOUT_MS` (1,500 ms) to resolve. The follower's effective poll window is therefore `waitMs − lockAcquisitionTime`, not `waitMs`.

For `riskScoresLive` (`waitMs: 4,000`): up to 2,500 ms of actual polling if the lock acquisition times out. Acceptable.

For `serviceStatuses` (`waitMs: 11,000`): up to 9,500 ms — still reasonable, but the 11-second stated ceiling is lower than it appears.

Consider documenting this accounting explicitly, or starting the deadline after the failed lock acquisition:

```ts
// Start the deadline only after confirming we are a follower,
// so the poll window isn't eroded by network round-trips.
const deadline = Date.now() + policy.waitMs;
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (5): Last reviewed commit: "fix(cache): align service status wait bu..." | Re-trigger Greptile}

The foundation PR introduced TypeScript extension imports in server/_shared/redis.ts, which broke Vercel's Edge bundling and diverged from the import pattern already used by the generated registry rollout. This change restores the repo-standard imports, keeps lock commands raw to avoid double-prefixing, propagates the true cache source for local joiners, and finishes the single-command Redis transport migration for deleteRedisKey. It also centralizes the Redis command parser used by the Redis-aware tests so the body-based transport and lock-release paths stay in sync across suites.

lspassos1 · 2026-04-12T16:04:17Z

Addressed the actionable bot findings on this foundation slice.

What changed:

restored repo-standard extensionless imports in server/_shared/redis.ts so the Edge bundle no longer references ./_generated/cache-fill-registry.ts / ./hash.ts
lock acquire/release now run with raw: true, avoiding the double-prefixing Greptile called out
local in-flight joiners now preserve the shared source instead of forcing 'fresh'
deleteRedisKey() now uses the same body-based single-command transport as setCachedJson()
centralized the Redis command parser in tests/helpers/fake-upstash-redis.mts and reused it in the Redis-aware stock tests

Validation:

npm run registry:generate
npm run registry:check
node --test tests/redis-caching.test.mjs
npm run typecheck
npm run typecheck:api
npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts

Root cause note: the Vercel failure was introduced in this PR, not by the earlier registry rollout. The registry phase already used extensionless generated/server imports (server/_shared/cache-keys.ts -> ./_generated/bootstrap-registry) and did not trip the Edge deploy.

Root cause: the coordinator foundation alone does not change runtime behavior until specific shared keys opt in through the generated registry. Changes: - add generated cache-fill policies for infra:service-statuses:v1 and risk:scores:sebuf:v1 - document the phase-two rollout and issue lineage in the architecture docs - add handler regressions that lock the existing stale/local fallback behavior for both enabled keys - widen the coordinator unit-test timing harness to remove cross-instance flake under full validation load Validation: - npm run registry:generate - npm run registry:check - node --test tests/redis-caching.test.mjs - npm run typecheck - npm run typecheck:api - npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts - npm run test:data Known unrelated failures kept out of scope: - npm run test:sidecar - npm run test:e2e:runtime Refs #16 Depends on #15

Inject a short-lived cache-fill registry into the handler fallback regressions so they do not inherit the 3s/4s production wait budgets. The rollout suite now exercises the same fallback paths in tens of milliseconds and widens the cross-instance coordinator window to reduce CI timing flake.

lspassos1 · 2026-04-13T00:20:06Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4491f59802

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T00:26:36Z

+  serviceStatuses: {
+    enabled: true,
+    leaseMs: 12_000,
+    waitMs: 3_000,


Raise serviceStatuses waitMs above upstream timeout

The serviceStatuses cache-fill policy waits only 3 seconds (waitMs: 3_000), but the leader fetch path can legitimately run much longer (each status check uses UPSTREAM_TIMEOUT_MS = 10_000 and they are aggregated in Promise.all), so follower instances will often hit timeout and return null before the leader publishes. In listServiceStatuses, a null result falls back to fallbackStatusesCache, which is empty on cold start, so users can receive an empty status list even though fresh data is still being computed and will be cached shortly after. This wait window should be aligned with the handler’s worst-case fill time (or use a hedging fallback) to avoid serving false-empty responses.

Useful? React with 👍 / 👎.

serviceStatuses followers were giving up before the leader could finish the 10s upstream fan-out and publish the shared result. Raise the coordinator wait/lease budget for infra:service-statuses:v1 so cross-instance cold misses do not fall back to an empty response while the leader is still computing.

vercel bot had a problem deploying to Preview April 12, 2026 15:19 Failure

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

greptile-apps bot reviewed Apr 12, 2026

View reviewed changes

Comment thread server/_shared/redis.ts Outdated

Comment thread tests/stock-analysis-history.test.mts Outdated

Comment thread tests/helpers/fake-upstash-redis.mts Outdated

vercel bot deployed to Preview April 12, 2026 16:02 View deployment

greptile-apps bot reviewed Apr 12, 2026

View reviewed changes

Comment thread server/_shared/redis.ts

fix(redis): derive EVAL key indexes from numkeys

d2520d7

vercel bot deployed to Preview April 12, 2026 16:42 View deployment

lspassos1 added 3 commits April 13, 2026 01:07

fix(registry): validate bootstrap tier completeness

4491f59

vercel bot deployed to Preview April 13, 2026 00:14 View deployment

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

Comment thread registry/datasets.ts

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

vercel bot deployed to Preview April 13, 2026 00:37 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cache): add distributed cache-fill coordinator foundation in fork#19

feat(cache): add distributed cache-fill coordinator foundation in fork#19
lspassos1 wants to merge 7 commits intomainfrom
feat/cache-fill-coordinator-foundation

lspassos1 commented Apr 12, 2026

Uh oh!

vercel bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Uh oh!

greptile-apps bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lspassos1 commented Apr 12, 2026

Uh oh!

Uh oh!

lspassos1 commented Apr 13, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lspassos1 commented Apr 12, 2026

Summary

Root cause

Changes

Validation

Risk

Type of change

Affected areas

Checklist

Screenshots

Uh oh!

vercel bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lspassos1 commented Apr 12, 2026

Uh oh!

Uh oh!

lspassos1 commented Apr 13, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 12, 2026 •

edited

Loading

greptile-apps bot commented Apr 12, 2026 •

edited

Loading