Skip to content

feat(cache): add distributed cache-fill coordinator foundation in fork#19

Open
lspassos1 wants to merge 7 commits intomainfrom
feat/cache-fill-coordinator-foundation
Open

feat(cache): add distributed cache-fill coordinator foundation in fork#19
lspassos1 wants to merge 7 commits intomainfrom
feat/cache-fill-coordinator-foundation

Conversation

@lspassos1
Copy link
Copy Markdown
Owner

Summary

This adds the fork-first distributed cache-fill coordinator foundation without enabling any production keys yet. The runtime now has a miss-only distributed singleflight primitive, body-based Redis publish commands, and generated cache-fill policy support, while the generated allowlist remains empty in this slice.

Root cause

cachedFetchJson() only coalesced cache misses inside a single process, so cross-instance cold misses could still stampede upstream fetches. At the same time, setCachedJson() still published JSON payloads through the URL path, which is the wrong transport for larger values and the wrong base for reliable leader publication.

Changes

  • extend DatasetContract.cacheFill with pollMinMs and pollMaxMs
  • add strict generator validation plus the new server/_shared/_generated/cache-fill-registry.ts artifact
  • move Redis single-command writes to JSON-body requests and keep pipeline normalization isolated
  • add the shared distributed coordinator in server/_shared/redis.ts with mandatory recheck, follower polling, guarded hedge path, token-safe unlock, and structured cache-fill logs
  • cover coordinator behavior and body-based Redis compatibility in the test suite

Validation

  • npm run registry:generate
  • npm run registry:check
  • node --test tests/redis-caching.test.mjs
  • npm run typecheck
  • npm run typecheck:api
  • npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts

Risk

Low to moderate. This changes shared cache runtime code, but it is still disabled for real keys in this PR because the generated allowlist is empty. Redis coordination failures explicitly degrade to the existing local singleflight behavior.

Type of change

  • Bug fix
  • New feature
  • New data source / feed
  • New map layer
  • Refactor / code cleanup
  • Documentation
  • CI / Build / Infrastructure

Affected areas

  • Map / Globe
  • News panels / RSS feeds
  • AI Insights / World Brief
  • Market Radar / Crypto
  • Desktop app (Tauri)
  • API endpoints (/api/*)
  • Config / Settings
  • Other: server cache runtime, generated registry artifacts, data tests

Checklist

  • Tested on worldmonitor.app variant
  • Tested on tech.worldmonitor.app variant (if applicable)
  • New RSS feed domains added to api/rss-proxy.js allowlist (if adding feeds)
  • No API keys or secrets committed
  • TypeScript compiles without errors (npm run typecheck)

Screenshots

Not applicable.

Refs #15
Depends on #14

Root cause: Redis miss coalescing only worked inside a single process and setCachedJson still published large payloads through the URL path.

Changes:
- add generated cache-fill policy support and validation to the dataset registry generator
- move Redis publish commands to JSON body requests and add a dedicated single-command helper
- add the shared distributed singleflight runtime with mandatory recheck, token-safe unlock, and structured logs
- cover coordinator behavior and body-based Redis compatibility in tests

Validation:
- npm run registry:generate
- npm run registry:check
- node --test tests/redis-caching.test.mjs
- npm run typecheck
- npm run typecheck:api
- npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts

Refs #15
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment Apr 13, 2026 0:37am

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 639563c56e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/_shared/redis.ts Outdated
const existing = inflight.get(key);
if (existing) {
const shared = await existing;
return { data: shared.data as T | null, source: 'fresh' };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate actual source for coalesced cache-fill joiners

When cache-fill coordination is enabled, the in-flight promise can now resolve with source: 'cache' (for example, leader lock acquisition rechecks Redis and finds a value from another instance), but this joiner path always rewrites the source to 'fresh'. That mislabels concurrent callers as fresh even when no upstream fetch ran, which can incorrectly trigger fresh-only behavior in downstream handlers (such as metadata updates or provider attribution). Return shared.source here instead of hardcoding 'fresh'.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 12, 2026

Greptile Summary

This PR lands the distributed cache-fill coordinator foundation by extending DatasetContract.cacheFill with polling parameters, adding strict generator validation, migrating Redis single-command writes to JSON-body requests via the new runRedisCommand helper, and implementing the full coordinator in server/_shared/redis.ts (miss-only, allowlist-only, token-safe lock, mandatory leader recheck, follower polling with jitter, hedge fallback, and structured logging). It also consolidates the parseRedisCommand test helper into tests/helpers/fake-upstash-redis.mts, resolving the previously flagged duplication.

Key points:

  • Coordinator is live on merge. Both infra:service-statuses:v1 and risk:scores:sebuf:v1 are enabled: true in CACHE_FILL_POLICIES and emitted into the generated registry. This contradicts the PR description ("allowlist remains empty"), but the code and tests are consistent with one another — the description appears to be stale from an earlier draft.
  • Documentation timing values for serviceStatuses are incorrect. cache-fill-coordinator.md documents leaseMs=12000, waitMs=3000 but the committed code has leaseMs=15000, waitMs=11000. The 8-second waitMs discrepancy is material: followers block up to 11 seconds on a cold miss before falling back to the stale module cache.
  • startedAt is captured before the lock acquisition attempt, so the follower's effective poll window is waitMs − networkRoundTrip (up to 1,500 ms less). This is fine for riskScoresLive (waitMs: 4,000) and only a minor accounting concern for serviceStatuses (waitMs: 11,000), but worth documenting.
  • Token-safe unlock (EVAL script) and the normalizeRedisCommand EVAL key-index fix are both present and tested. The new pipeline EVAL test confirms that numkeys is used to locate keys rather than the default index 1.
  • Previously flagged issues: The parseRedisCommand helper is now shared (duplication resolved). The lock-key double-prefix note and the EVAL key-index warning from prior threads are acknowledged in the implementation (raw: true on all lock commands; the EVAL branch in normalizeRedisCommand now correctly derives key positions from numkeys).

Confidence Score: 4/5

Safe to merge; coordinator is live for two low-risk keys with handler-level stale fallbacks, and the core locking logic is correct and well-tested.

The implementation is solid: token-safe unlock, mandatory leader recheck, jitter polling, and graceful degradation on Redis errors are all present and covered by integration tests. Previous review concerns (EVAL key index, parseRedisCommand duplication) have been addressed. The only new findings are a stale timing entry in the docs and a minor startedAt accounting note — neither affects correctness. Score reflects one concrete doc fix remaining before full confidence.

docs/architecture/cache-fill-coordinator.md — timing values for serviceStatuses need updating to match the committed code.

Important Files Changed

Filename Overview
server/_shared/redis.ts Core change: adds runRedisCommand for body-based Redis calls, migrates setCachedJson/deleteRedisKey to it, and introduces the full distributed coordinator (lock acquire, mandatory recheck, follower polling, hedge, token-safe unlock). Logic is sound; minor concern about startedAt being captured before lock acquisition, which slightly shrinks the follower poll window.
registry/datasets.ts Adds pollMinMs/pollMaxMs to DatasetContract.cacheFill and populates two enabled CACHE_FILL_POLICIES entries (serviceStatuses / riskScoresLive). Both are enabled: true and emitted into the generated registry, making the coordinator live on merge — consistent with the code but contradicting the PR description.
docs/architecture/cache-fill-coordinator.md New architecture doc; the "Initial policy values" table at lines 58–60 lists stale timing values for serviceStatuses (leaseMs=12000, waitMs=3000) that don't match the committed code (leaseMs=15000, waitMs=11000).
server/_shared/_generated/cache-fill-registry.ts Auto-generated artifact; two entries for infra:service-statuses:v1 and risk:scores:sebuf:v1 with correct structure. Timing values match registry/datasets.ts (source of truth) — the mismatch is only in the markdown docs.
tests/helpers/fake-upstash-redis.mts Significant expansion: adds parseRedisCommand (now the shared helper both test files import), executeCommand dispatcher, expiry-aware readString/writeString/deleteKey, and full NX/EX/PX/EVAL support. Resolves the previous duplication concern.
tests/redis-caching.test.mjs Adds eight coordinator integration tests covering leader/follower coalescing, mandatory recheck, sentinel propagation, timeout fallbacks, hedge retry, token-safe unlock, invariant enforcement, and handler-level stale-cache fallbacks. Good coverage; parseRedisCommand is now imported from the shared helper.
scripts/generate-dataset-registry.ts Adds strict validation for pollMinMs/pollMaxMs invariants and a new bootstrap.tier guard. Emits cache-fill-registry.ts correctly; validation logic matches the runtime invariants in resolveCacheMiss.
scripts/check-dataset-registry.mjs Adds cache-fill-registry.ts to the git diff --exit-code staleness check, ensuring CI catches drift between source and generated artifact.
tests/stock-analysis-history.test.mts Migrates Redis GET/SET parsing to the shared parseRedisCommand helper, replacing inline path-split logic. No behavior change; resolves the previously flagged duplication.
tests/stock-backtest.test.mts Same import migration as stock-analysis-history.test.mts — adopts parseRedisCommand from the shared helper. No behavior change.

Sequence Diagram

sequenceDiagram
    participant CA as Instance A (Leader)
    participant CB as Instance B (Follower)
    participant R as Redis

    Note over CA,CB: Both instances observe a cache miss

    CA->>R: GET prefixKey(key) → null
    CB->>R: GET prefixKey(key) → null

    CA->>R: SET lockKey tokenA NX PX leaseMs → OK
    CB->>R: SET lockKey tokenB NX PX leaseMs → null (lock held)

    Note over CA: Leader: recheck before fetching
    CA->>R: GET prefixKey(key) → null (still miss)
    CA->>CA: run fetcher()
    CA->>R: SET prefixKey(key) JSON(result) EX ttl
    CA->>R: EVAL token-safe unlock → DEL lockKey

    Note over CB: Follower: poll with jitter until value or deadline
    loop poll (pollMinMs–pollMaxMs jitter)
        CB->>R: GET prefixKey(key) → null or value
    end
    CB->>CB: return observed value (source: cache)

    alt Follower times out (fallback=return_null)
        CB->>CB: return null → handler uses stale cache
    else Follower times out (fallback=hedge)
        CB->>R: SET lockKey tokenB2 NX PX leaseMs
        alt Hedge lock acquired
            CB->>R: GET prefixKey(key) recheck
            CB->>CB: run fetcher() if still miss
        else Hedge lock not acquired
            CB->>CB: return null
        end
    else Follower times out (fallback=throw)
        CB->>CB: throw cache-fill timeout error
    end
Loading

Fix All in Codex

Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/architecture/cache-fill-coordinator.md
Line: 58-60

Comment:
**Timing values in docs don't match the committed code**

The documentation lists different timing values from what is actually committed in `registry/datasets.ts`:

| Field | Doc says | Code has |
|---|---|---|
| `serviceStatuses.leaseMs` | 12000 | 15000 |
| `serviceStatuses.waitMs` | 3000 | **11000** |

The `waitMs` discrepancy is material: followers will poll for up to **11 seconds** (not 3), meaning API requests to `list-service-statuses` that hit a cold miss while another instance is the leader could block for up to 11 seconds before falling back to the stale module cache. The 4× difference makes the doc actively misleading when triaging coordinator timeout logs.

```suggestion
- `serviceStatuses`: `leaseMs=15000`, `waitMs=11000`, `pollMinMs=75`, `pollMaxMs=175`, `fallback='return_null'`
- `riskScoresLive`: `leaseMs=15000`, `waitMs=4000`, `pollMinMs=100`, `pollMaxMs=250`, `fallback='return_null'`
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: server/_shared/redis.ts
Line: 404-421

Comment:
**`waitMs` budget already consumed by lock acquisition when follower starts polling**

`startedAt` is captured before the lock acquisition attempt (line 359). `tryAcquireFillLock` can take up to `REDIS_OP_TIMEOUT_MS` (1,500 ms) to resolve. The follower's effective poll window is therefore `waitMs − lockAcquisitionTime`, not `waitMs`.

For `riskScoresLive` (`waitMs: 4,000`): up to 2,500 ms of actual polling if the lock acquisition times out. Acceptable.

For `serviceStatuses` (`waitMs: 11,000`): up to 9,500 ms — still reasonable, but the 11-second stated ceiling is lower than it appears.

Consider documenting this accounting explicitly, or starting the deadline after the failed lock acquisition:

```ts
// Start the deadline only after confirming we are a follower,
// so the poll window isn't eroded by network round-trips.
const deadline = Date.now() + policy.waitMs;
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (5): Last reviewed commit: "fix(cache): align service status wait bu..." | Re-trigger Greptile

Comment thread server/_shared/redis.ts Outdated
Comment thread tests/stock-analysis-history.test.mts Outdated
Comment thread tests/helpers/fake-upstash-redis.mts Outdated
The foundation PR introduced TypeScript extension imports in server/_shared/redis.ts, which broke Vercel's Edge bundling and diverged from the import pattern already used by the generated registry rollout. This change restores the repo-standard imports, keeps lock commands raw to avoid double-prefixing, propagates the true cache source for local joiners, and finishes the single-command Redis transport migration for deleteRedisKey.

It also centralizes the Redis command parser used by the Redis-aware tests so the body-based transport and lock-release paths stay in sync across suites.
Copy link
Copy Markdown
Owner Author

Addressed the actionable bot findings on this foundation slice.

What changed:

  • restored repo-standard extensionless imports in server/_shared/redis.ts so the Edge bundle no longer references ./_generated/cache-fill-registry.ts / ./hash.ts
  • lock acquire/release now run with raw: true, avoiding the double-prefixing Greptile called out
  • local in-flight joiners now preserve the shared source instead of forcing 'fresh'
  • deleteRedisKey() now uses the same body-based single-command transport as setCachedJson()
  • centralized the Redis command parser in tests/helpers/fake-upstash-redis.mts and reused it in the Redis-aware stock tests

Validation:

  • npm run registry:generate
  • npm run registry:check
  • node --test tests/redis-caching.test.mjs
  • npm run typecheck
  • npm run typecheck:api
  • npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts

Root cause note: the Vercel failure was introduced in this PR, not by the earlier registry rollout. The registry phase already used extensionless generated/server imports (server/_shared/cache-keys.ts -> ./_generated/bootstrap-registry) and did not trip the Edge deploy.

Comment thread server/_shared/redis.ts
Root cause: the coordinator foundation alone does not change runtime behavior until specific shared keys opt in through the generated registry.

Changes:
- add generated cache-fill policies for infra:service-statuses:v1 and risk:scores:sebuf:v1
- document the phase-two rollout and issue lineage in the architecture docs
- add handler regressions that lock the existing stale/local fallback behavior for both enabled keys
- widen the coordinator unit-test timing harness to remove cross-instance flake under full validation load

Validation:
- npm run registry:generate
- npm run registry:check
- node --test tests/redis-caching.test.mjs
- npm run typecheck
- npm run typecheck:api
- npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts
- npm run test:data

Known unrelated failures kept out of scope:
- npm run test:sidecar
- npm run test:e2e:runtime

Refs #16
Depends on #15
Inject a short-lived cache-fill registry into the handler fallback regressions so they do not inherit the 3s/4s production wait budgets. The rollout suite now exercises the same fallback paths in tens of milliseconds and widens the cross-instance coordinator window to reduce CI timing flake.
@lspassos1
Copy link
Copy Markdown
Owner Author

@codex review

Comment thread registry/datasets.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4491f59802

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread registry/datasets.ts Outdated
serviceStatuses: {
enabled: true,
leaseMs: 12_000,
waitMs: 3_000,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise serviceStatuses waitMs above upstream timeout

The serviceStatuses cache-fill policy waits only 3 seconds (waitMs: 3_000), but the leader fetch path can legitimately run much longer (each status check uses UPSTREAM_TIMEOUT_MS = 10_000 and they are aggregated in Promise.all), so follower instances will often hit timeout and return null before the leader publishes. In listServiceStatuses, a null result falls back to fallbackStatusesCache, which is empty on cold start, so users can receive an empty status list even though fresh data is still being computed and will be cached shortly after. This wait window should be aligned with the handler’s worst-case fill time (or use a hedging fallback) to avoid serving false-empty responses.

Useful? React with 👍 / 👎.

serviceStatuses followers were giving up before the leader could finish the 10s upstream fan-out and publish the shared result.

Raise the coordinator wait/lease budget for infra:service-statuses:v1 so cross-instance cold misses do not fall back to an empty response while the leader is still computing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant