ci: retry Railway preview service discovery by vnv-varun · Pull Request #2800 · inkeep/agents

vnv-varun · 2026-03-23T14:30:17Z

Summary

Retry Railway environment and service ID discovery when provisioning per-PR preview infrastructure so preview setup does not fail on freshly cloned Railway environments.

Changes

add railway_wait_for_environment_id() in .github/scripts/preview/common.sh
add railway_wait_for_service_id_for_env() in .github/scripts/preview/common.sh
update railway_ensure_tcp_proxy() to wait for Railway to materialize cloned service instances before creating TCP proxies
preserve fail-fast behavior by propagating wait-function failures with || return 1

Why

Preview provisioning can fail immediately after cloning preview-base because Railway has created the environment but has not yet populated serviceInstances for the cloned services. In that state, railway_ensure_tcp_proxy() fails with Unable to resolve Railway service ID for <service> in pr-<n>.

This is a Railway timing issue, not a PR-specific logic problem. Retrying environment and service discovery makes the preview workflow resilient to Railway's eventual consistency while still failing quickly if discovery never succeeds.

Test Plan

bash -n .github/scripts/preview/*.sh
Parse .github/workflows/preview-environments.yml with yaml.safe_load
git diff --check origin/main...HEAD
Verify the fixed head passes Provision Tier 1 (Railway) on this PR
Verify the full preview workflow passes end to end on this PR
Verify pr-review, ci, and Cypress E2E Tests pass on this PR

Notes

This is intentionally scoped to the pre-existing Railway provisioning flake that surfaced while validating the stable preview URL comment change in #2799.

changeset-bot · 2026-03-23T14:30:22Z

⚠️ No Changeset found

Latest commit: 29365d9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2026-03-23T14:30:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Mar 23, 2026 2:32pm
agents-docs	Ready	Preview, Comment	Mar 23, 2026 2:32pm
agents-manage-ui	Ready	Preview, Comment	Mar 23, 2026 2:32pm

pullfrog · 2026-03-23T14:31:39Z

TL;DR — Adds retry-with-jitter loops around Railway preview environment and service ID lookups so that transient API delays no longer break TCP proxy setup during CI preview deployments, and ensures discovery failures propagate correctly via || return 1.

Key changes

Add railway_wait_for_environment_id and railway_wait_for_service_id_for_env retry wrappers — poll the Railway GraphQL API up to max_attempts times with jittered sleep, replacing the previous single-shot lookups that failed silently on empty responses.
Refactor railway_ensure_tcp_proxy to use the new retry helpers — removes inline null-check-and-bail logic in favor of the dedicated wait functions, with || return 1 to propagate failures.

_{Summary ｜ 1 file ｜ 2 commits ｜ base: main ← varun/preview-railway-service-discovery}

Retry loop for Railway service discovery

Before: railway_ensure_tcp_proxy called railway_environment_id and railway_service_id_for_env once each, failing immediately if either returned empty — a common occurrence when Railway's API lags behind environment creation.
After: Two new helpers (railway_wait_for_environment_id, railway_wait_for_service_id_for_env) wrap those calls in a configurable retry loop (default 30 attempts, 2 s jittered sleep), and railway_ensure_tcp_proxy delegates to them with || return 1 to propagate failures.

Both helpers share the same pattern: loop up to max_attempts, call the underlying lookup, return early on success, and sleep_with_jitter between attempts. The existing max_attempts and sleep_seconds parameters from railway_ensure_tcp_proxy are forwarded through, so callers keep the same interface.

Why jittered sleep instead of fixed backoff?
The script already defines a sleep_with_jitter utility used elsewhere in the file. Reusing it avoids thundering-herd effects when multiple preview services retry concurrently during the same CI run.

.github/scripts/preview/common.sh

^{｜ View workflow run ｜ Triggered by Pullfrog ｜ pullfrog.com ｜ 𝕏}

pullfrog

Clean change — the retry wrappers follow the same pattern as railway_extract_runtime_var and correctly address the Railway eventual-consistency flake. One observation about the retry budget worth considering.

^{｜ View workflow run ｜ pullfrog.com ｜ 𝕏}

pullfrog · 2026-03-23T14:31:57Z

.github/scripts/preview/common.sh

+  env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
+  service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"


Nit (non-blocking): railway_ensure_tcp_proxy forwards its full max_attempts/sleep_seconds to each of the two new wait functions, then runs its own polling loop with the same budget. Worst-case wall time tripled from ~60 s to ~180 s (30 × 2 s × 3 phases) compared to the previous single-phase version.

This is probably fine in practice — the environment and service IDs should resolve quickly — but it's worth being aware of. If CI timeout pressure becomes a concern later, consider giving the discovery phases a smaller attempt budget (e.g., 10) while keeping the TCP-proxy polling at 30.

pullfrog

Looks good — approving.

^{｜ View workflow run ｜ pullfrog.com ｜ 𝕏}

claude

PR Review Summary

(1) Total Issues | Risk: Medium

🟠⚠️ Major (1) 🟠⚠️

Inline Comments:

🟠 Major: common.sh:251-252 Missing error propagation from wait functions — failures waste 60s retrying with invalid input

💭 Consider (4) 💭

💭 1) common.sh:196,221 Add progress logging during retry loops

Issue: When retry loops take up to 60 seconds, operators have no visibility into whether the script is progressing or stuck.

Why: Silent retry loops make incident debugging harder since there's no indication of which attempt failed or how long the wait has been.

Fix: Add progress logging to stderr after the first attempt:

if [ "${attempt}" -gt 1 ]; then
  echo "Waiting for environment ID (attempt ${attempt}/${max_attempts})..." >&2
fi

💭 2) common.sh:191,217 Default max_attempts differs from existing pattern

Issue: The new wait functions default to max_attempts=30 while the existing railway_extract_runtime_var uses max_attempts=20.

Why: This inconsistency could lead to different retry behavior expectations. 30 attempts × ~2s = ~60-90s max wait vs 20 × ~2s = ~40-60s for the existing function.

Fix: Consider aligning with the existing default (20) or documenting why service discovery needs more attempts.

Refs:

railway_extract_runtime_var defaults to 20

💭 3) system Be aware of cascading retry latency

Issue: The railway_ensure_tcp_proxy function now chains two wait functions plus its own internal retry loop, resulting in worst-case latency of ~180 seconds.

Why: Environment wait (30 × 2s) + service wait (30 × 2s) + TCP proxy wait (30 × 2s) = 180s total. This is acceptable for CI but operators should be aware of the timeout budget.

Fix: Consider documenting the total timeout budget or adding a comment noting the worst-case latency.

💭 4) common.sh:191 Consider exponential backoff

Issue: The retry strategy uses fixed 2-second base interval with jitter rather than exponential backoff.

Why: Exponential backoff is the industry standard for retry patterns and would reduce load on Railway's API during extended outages. However, the current approach follows the existing railway_extract_runtime_var pattern in the file.

Fix: For consistency, keeping linear backoff is reasonable. If changing, consider: sleep_seconds * (2 ** min(attempt - 1, 4)) * jitter capped at 30s.

Refs:

AWS: Timeouts, retries, and backoff with jitter

🚫 REQUEST CHANGES

Summary: The retry logic approach is sound and addresses the Railway eventual consistency issue. However, the removal of explicit error handling creates a regression: if environment lookup fails, the script wastes ~60 seconds retrying service lookup with an empty environment ID before failing with a confusing error. Adding || return 1 to both wait function calls (see inline comment) restores defensive error handling without relying on caller shell strictness.

Discarded (0)

No findings were discarded.

Reviewers (3)

Reviewer	Returned	Consider	Inline Comments	Discarded
`pr-review-standards`	1	0	1	0
`pr-review-sre`	5	3	1	1
`pr-review-devops`	2	1	1	0
Total	8	4	1	1

Note: Error propagation finding was raised by all 3 reviewers and consolidated into single inline comment. SRE's exponential backoff suggestion was kept as Consider since it conflicts with existing file patterns.

claude · 2026-03-23T14:33:56Z

.github/scripts/preview/common.sh

+  env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
+  service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"


🟠 MAJOR: Missing error propagation from wait functions

Issue: If railway_wait_for_environment_id fails and returns exit code 1, the script continues with an empty env_id. The subsequent railway_wait_for_service_id_for_env call then retries 30 times with an empty environment ID before failing with a misleading error.

Why: Since this file is sourced and "callers own shell strictness" (per line 2), set -e cannot be assumed. The original code had explicit -z checks that prevented this cascading failure. Without explicit error handling, failures in the first lookup waste ~60 seconds retrying the second lookup with invalid input.

Fix: (1-click apply)

Suggested change

env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"

service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"

env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1

service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1

Refs:

Original explicit checks (removed in this PR)

File header note on shell strictness

Fixed in 54759a603 by restoring explicit error propagation on both wait-command substitutions with || return 1. That preserves the original fail-fast behavior when environment discovery fails.

claude

PR Review Summary

(0) Total Issues | Risk: Low

Re-review Delta Assessment

The previous review identified 1 MAJOR issue: missing error propagation from wait functions at lines 251-252. The author addressed this in commit 54759a603 by adding || return 1 to both wait function calls.

Verification:

# Before (missing error propagation)
env_id="$(railway_wait_for_environment_id ...)"
service_id="$(railway_wait_for_service_id_for_env ...)"

# After (correct error propagation)
env_id="$(railway_wait_for_environment_id ...)" || return 1
service_id="$(railway_wait_for_service_id_for_env ...)" || return 1

The fix is correct and complete:

If railway_wait_for_environment_id fails, the function now returns immediately with exit code 1
If railway_wait_for_service_id_for_env fails, the function returns immediately with exit code 1
No wasted retry cycles with empty/invalid input

🕐 Pending Recommendations (4)

The following suggestions from the previous review remain open for the author's consideration. These are non-blocking improvements:

💭 Progress logging during retry loops — Would improve operator visibility during long waits
💭 Default max_attempts consistency — New functions default to 30 vs existing pattern of 20
💭 Cascading retry latency awareness — Worst-case ~180s total timeout
💭 Exponential backoff consideration — Industry standard, but current linear approach is consistent with existing file patterns

✅ APPROVE

Summary: The blocking issue from the previous review has been correctly addressed. Error propagation now works as expected — failures in environment or service discovery will immediately abort railway_ensure_tcp_proxy rather than wasting ~60 seconds retrying with invalid input. The retry logic cleanly handles Railway's eventual consistency without introducing silent failures. Ship it! 🚀

Reviewers (0)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
Total	0	0	0	0	0	0	0

Note: No reviewers dispatched for this delta re-review. The change was a 2-line fix directly addressing the previous MAJOR finding. Manual verification confirmed correctness.

itoqa · 2026-03-23T15:40:38Z

Ito Test Report ✅

15 test cases ran. 15 passed.

All 15 test cases passed with zero failures, indicating no likely production-code defects across preview-environment provisioning, workflow routing, diagnostics/error handling, UI auth navigation, and security-critical API/auth paths. Key validated behaviors included successful and idempotent Tier 1 provisioning with TCP proxies converging to ACTIVE (including partial-readiness, transient create-failure recovery, and rapid concurrent dispatches without race issues), correct closed-PR teardown-only gating, failure diagnostics that redact secrets while preserving actionable context, explicit missing-service errors that block downstream jobs, and consistent rejection of unauthorized access patterns (wrong passwords, tampered/stale cookies, cross-tenant path tampering, and origin-mismatched sign-in) while mobile deep-link and refresh/back-forward flows maintained coherent protected-content access rules.

✅ Passed (15)

Category	Summary	Screenshot
Adversarial	Three rapid concurrent provisioning simulations completed without race-induced fatal lookup or duplicate-proxy errors.
Adversarial	Wrong-password sign-in attempts were denied and unauthenticated manage access stayed unauthorized.
Adversarial	Valid local sign-in issued a cookie, while tampered and stale cookies were rejected without data leakage.
Adversarial	Valid tenant access succeeded and cross-tenant path tampering was forbidden.
Adversarial	Origin-mismatched sign-in was rejected with no usable privileged session.
Adversarial	After abusive refresh/back/forward during active login transition, app settled to a coherent unauthenticated state and protected API returned 401, confirming no partial authorization state.
Edge	Provisioning simulation converged with all three TCP proxies ready and no terminal unresolved environment/service ID failures.
Edge	Confirmed missing service resolution fails with an explicit unresolved-service error including service/env context, and dependent jobs remain gated behind provision-tier1 success.
Edge	Transient create-environment failure recovery path re-checks existence and continues without a product defect.
Edge	On mobile viewport, protected deep-link redirected to login pre-auth; after valid login, refresh/back-forward navigation remained stable and protected projects rendered only when authenticated.
Logic	TCP proxy readiness polling converged to ACTIVE for provisioned services in inspected successful runs.
Logic	Triggered a safe smoke failure and confirmed diagnostics preserved actionable status details while redacting credentials, tokens, DB URLs, and session cookie values.
Happy-path	Manual dispatch flow reached Provision Tier 1 through smoke checks without a product-code defect signal.
Happy-path	Existing PR environment rerun behavior remained stable and idempotent in the verified workflow path.
Happy-path	Closed-PR routing in the preview workflow is correctly constrained to teardown-only execution gates.

Commit: 54759a6

View Full Run

Tell us how we did: Give Ito Feedback

ci: retry Railway preview service discovery

29365d9

pullfrog bot reviewed Mar 23, 2026

View reviewed changes

vercel bot deployed to Preview – agents-docs March 23, 2026 14:32 View deployment

vercel bot deployed to Preview – agents-api March 23, 2026 14:32 View deployment

vercel bot deployed to Preview – agents-manage-ui March 23, 2026 14:32 View deployment

claude bot reviewed Mar 23, 2026

View reviewed changes

github-actions bot deleted a comment from claude bot Mar 23, 2026

ci: propagate Railway discovery failures

54759a6

claude bot reviewed Mar 23, 2026

View reviewed changes

github-actions bot deleted a comment from claude bot Mar 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: retry Railway preview service discovery#2800

ci: retry Railway preview service discovery#2800
vnv-varun wants to merge 2 commits intomainfrom
varun/preview-railway-service-discovery

vnv-varun commented Mar 23, 2026 •

edited

Loading

Uh oh!

changeset-bot bot commented Mar 23, 2026

Uh oh!

vercel bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

pullfrog bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

pullfrog bot left a comment

Uh oh!

pullfrog bot Mar 23, 2026

Uh oh!

pullfrog bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

claude bot Mar 23, 2026

Uh oh!

vnv-varun Mar 23, 2026

Uh oh!

claude bot left a comment

Uh oh!

itoqa bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
		service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"

Conversation

vnv-varun commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why

Test Plan

Notes

Uh oh!

changeset-bot bot commented Mar 23, 2026

⚠️ No Changeset found

Uh oh!

vercel bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pullfrog bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

Retry loop for Railway service discovery

Uh oh!

pullfrog bot left a comment

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (1) 🟠⚠️

💭 Consider (4) 💭

🚫 REQUEST CHANGES

Uh oh!

claude bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vnv-varun Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

Re-review Delta Assessment

🕐 Pending Recommendations (4)

✅ APPROVE

Uh oh!

itoqa bot commented Mar 23, 2026

Ito Test Report ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vnv-varun commented Mar 23, 2026 •

edited

Loading

vercel bot commented Mar 23, 2026 •

edited

Loading

pullfrog bot commented Mar 23, 2026 •

edited

Loading