feat(cache): add distributed cache-fill coordinator foundation in fork#19
feat(cache): add distributed cache-fill coordinator foundation in fork#19
Conversation
Root cause: Redis miss coalescing only worked inside a single process and setCachedJson still published large payloads through the URL path. Changes: - add generated cache-fill policy support and validation to the dataset registry generator - move Redis publish commands to JSON body requests and add a dedicated single-command helper - add the shared distributed singleflight runtime with mandatory recheck, token-safe unlock, and structured logs - cover coordinator behavior and body-based Redis compatibility in tests Validation: - npm run registry:generate - npm run registry:check - node --test tests/redis-caching.test.mjs - npm run typecheck - npm run typecheck:api - npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts Refs #15
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 639563c56e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const existing = inflight.get(key); | ||
| if (existing) { | ||
| const shared = await existing; | ||
| return { data: shared.data as T | null, source: 'fresh' }; |
There was a problem hiding this comment.
Propagate actual source for coalesced cache-fill joiners
When cache-fill coordination is enabled, the in-flight promise can now resolve with source: 'cache' (for example, leader lock acquisition rechecks Redis and finds a value from another instance), but this joiner path always rewrites the source to 'fresh'. That mislabels concurrent callers as fresh even when no upstream fetch ran, which can incorrectly trigger fresh-only behavior in downstream handlers (such as metadata updates or provider attribution). Return shared.source here instead of hardcoding 'fresh'.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR lands the distributed cache-fill coordinator foundation by extending Key points:
Confidence Score: 4/5Safe to merge; coordinator is live for two low-risk keys with handler-level stale fallbacks, and the core locking logic is correct and well-tested. The implementation is solid: token-safe unlock, mandatory leader recheck, jitter polling, and graceful degradation on Redis errors are all present and covered by integration tests. Previous review concerns (EVAL key index, parseRedisCommand duplication) have been addressed. The only new findings are a stale timing entry in the docs and a minor startedAt accounting note — neither affects correctness. Score reflects one concrete doc fix remaining before full confidence. docs/architecture/cache-fill-coordinator.md — timing values for serviceStatuses need updating to match the committed code. Important Files Changed
Sequence DiagramsequenceDiagram
participant CA as Instance A (Leader)
participant CB as Instance B (Follower)
participant R as Redis
Note over CA,CB: Both instances observe a cache miss
CA->>R: GET prefixKey(key) → null
CB->>R: GET prefixKey(key) → null
CA->>R: SET lockKey tokenA NX PX leaseMs → OK
CB->>R: SET lockKey tokenB NX PX leaseMs → null (lock held)
Note over CA: Leader: recheck before fetching
CA->>R: GET prefixKey(key) → null (still miss)
CA->>CA: run fetcher()
CA->>R: SET prefixKey(key) JSON(result) EX ttl
CA->>R: EVAL token-safe unlock → DEL lockKey
Note over CB: Follower: poll with jitter until value or deadline
loop poll (pollMinMs–pollMaxMs jitter)
CB->>R: GET prefixKey(key) → null or value
end
CB->>CB: return observed value (source: cache)
alt Follower times out (fallback=return_null)
CB->>CB: return null → handler uses stale cache
else Follower times out (fallback=hedge)
CB->>R: SET lockKey tokenB2 NX PX leaseMs
alt Hedge lock acquired
CB->>R: GET prefixKey(key) recheck
CB->>CB: run fetcher() if still miss
else Hedge lock not acquired
CB->>CB: return null
end
else Follower times out (fallback=throw)
CB->>CB: throw cache-fill timeout error
end
Prompt To Fix All With AIThis is a comment left during a code review.
Path: docs/architecture/cache-fill-coordinator.md
Line: 58-60
Comment:
**Timing values in docs don't match the committed code**
The documentation lists different timing values from what is actually committed in `registry/datasets.ts`:
| Field | Doc says | Code has |
|---|---|---|
| `serviceStatuses.leaseMs` | 12000 | 15000 |
| `serviceStatuses.waitMs` | 3000 | **11000** |
The `waitMs` discrepancy is material: followers will poll for up to **11 seconds** (not 3), meaning API requests to `list-service-statuses` that hit a cold miss while another instance is the leader could block for up to 11 seconds before falling back to the stale module cache. The 4× difference makes the doc actively misleading when triaging coordinator timeout logs.
```suggestion
- `serviceStatuses`: `leaseMs=15000`, `waitMs=11000`, `pollMinMs=75`, `pollMaxMs=175`, `fallback='return_null'`
- `riskScoresLive`: `leaseMs=15000`, `waitMs=4000`, `pollMinMs=100`, `pollMaxMs=250`, `fallback='return_null'`
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: server/_shared/redis.ts
Line: 404-421
Comment:
**`waitMs` budget already consumed by lock acquisition when follower starts polling**
`startedAt` is captured before the lock acquisition attempt (line 359). `tryAcquireFillLock` can take up to `REDIS_OP_TIMEOUT_MS` (1,500 ms) to resolve. The follower's effective poll window is therefore `waitMs − lockAcquisitionTime`, not `waitMs`.
For `riskScoresLive` (`waitMs: 4,000`): up to 2,500 ms of actual polling if the lock acquisition times out. Acceptable.
For `serviceStatuses` (`waitMs: 11,000`): up to 9,500 ms — still reasonable, but the 11-second stated ceiling is lower than it appears.
Consider documenting this accounting explicitly, or starting the deadline after the failed lock acquisition:
```ts
// Start the deadline only after confirming we are a follower,
// so the poll window isn't eroded by network round-trips.
const deadline = Date.now() + policy.waitMs;
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (5): Last reviewed commit: "fix(cache): align service status wait bu..." | Re-trigger Greptile |
The foundation PR introduced TypeScript extension imports in server/_shared/redis.ts, which broke Vercel's Edge bundling and diverged from the import pattern already used by the generated registry rollout. This change restores the repo-standard imports, keeps lock commands raw to avoid double-prefixing, propagates the true cache source for local joiners, and finishes the single-command Redis transport migration for deleteRedisKey. It also centralizes the Redis command parser used by the Redis-aware tests so the body-based transport and lock-release paths stay in sync across suites.
|
Addressed the actionable bot findings on this foundation slice. What changed:
Validation:
Root cause note: the Vercel failure was introduced in this PR, not by the earlier registry rollout. The registry phase already used extensionless generated/server imports ( |
Root cause: the coordinator foundation alone does not change runtime behavior until specific shared keys opt in through the generated registry. Changes: - add generated cache-fill policies for infra:service-statuses:v1 and risk:scores:sebuf:v1 - document the phase-two rollout and issue lineage in the architecture docs - add handler regressions that lock the existing stale/local fallback behavior for both enabled keys - widen the coordinator unit-test timing harness to remove cross-instance flake under full validation load Validation: - npm run registry:generate - npm run registry:check - node --test tests/redis-caching.test.mjs - npm run typecheck - npm run typecheck:api - npm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mts - npm run test:data Known unrelated failures kept out of scope: - npm run test:sidecar - npm run test:e2e:runtime Refs #16 Depends on #15
Inject a short-lived cache-fill registry into the handler fallback regressions so they do not inherit the 3s/4s production wait budgets. The rollout suite now exercises the same fallback paths in tens of milliseconds and widens the cross-instance coordinator window to reduce CI timing flake.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4491f59802
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| serviceStatuses: { | ||
| enabled: true, | ||
| leaseMs: 12_000, | ||
| waitMs: 3_000, |
There was a problem hiding this comment.
Raise serviceStatuses waitMs above upstream timeout
The serviceStatuses cache-fill policy waits only 3 seconds (waitMs: 3_000), but the leader fetch path can legitimately run much longer (each status check uses UPSTREAM_TIMEOUT_MS = 10_000 and they are aggregated in Promise.all), so follower instances will often hit timeout and return null before the leader publishes. In listServiceStatuses, a null result falls back to fallbackStatusesCache, which is empty on cold start, so users can receive an empty status list even though fresh data is still being computed and will be cached shortly after. This wait window should be aligned with the handler’s worst-case fill time (or use a hedging fallback) to avoid serving false-empty responses.
Useful? React with 👍 / 👎.
serviceStatuses followers were giving up before the leader could finish the 10s upstream fan-out and publish the shared result. Raise the coordinator wait/lease budget for infra:service-statuses:v1 so cross-instance cold misses do not fall back to an empty response while the leader is still computing.
Summary
This adds the fork-first distributed cache-fill coordinator foundation without enabling any production keys yet. The runtime now has a miss-only distributed singleflight primitive, body-based Redis publish commands, and generated cache-fill policy support, while the generated allowlist remains empty in this slice.
Root cause
cachedFetchJson()only coalesced cache misses inside a single process, so cross-instance cold misses could still stampede upstream fetches. At the same time,setCachedJson()still published JSON payloads through the URL path, which is the wrong transport for larger values and the wrong base for reliable leader publication.Changes
DatasetContract.cacheFillwithpollMinMsandpollMaxMsserver/_shared/_generated/cache-fill-registry.tsartifactserver/_shared/redis.tswith mandatory recheck, follower polling, guarded hedge path, token-safe unlock, and structured cache-fill logsValidation
npm run registry:generatenpm run registry:checknode --test tests/redis-caching.test.mjsnpm run typechecknpm run typecheck:apinpm exec tsx -- --test tests/stock-backtest.test.mts tests/stock-analysis-history.test.mtsRisk
Low to moderate. This changes shared cache runtime code, but it is still disabled for real keys in this PR because the generated allowlist is empty. Redis coordination failures explicitly degrade to the existing local singleflight behavior.
Type of change
Affected areas
/api/*)Checklist
api/rss-proxy.jsallowlist (if adding feeds)npm run typecheck)Screenshots
Not applicable.
Refs #15
Depends on #14