Skip to content

ci: retry Railway preview service discovery#2800

Open
vnv-varun wants to merge 2 commits intomainfrom
varun/preview-railway-service-discovery
Open

ci: retry Railway preview service discovery#2800
vnv-varun wants to merge 2 commits intomainfrom
varun/preview-railway-service-discovery

Conversation

@vnv-varun
Copy link
Contributor

@vnv-varun vnv-varun commented Mar 23, 2026

Summary

Retry Railway environment and service ID discovery when provisioning per-PR preview infrastructure so preview setup does not fail on freshly cloned Railway environments.

Changes

  • add railway_wait_for_environment_id() in .github/scripts/preview/common.sh
  • add railway_wait_for_service_id_for_env() in .github/scripts/preview/common.sh
  • update railway_ensure_tcp_proxy() to wait for Railway to materialize cloned service instances before creating TCP proxies
  • preserve fail-fast behavior by propagating wait-function failures with || return 1

Why

Preview provisioning can fail immediately after cloning preview-base because Railway has created the environment but has not yet populated serviceInstances for the cloned services. In that state, railway_ensure_tcp_proxy() fails with Unable to resolve Railway service ID for <service> in pr-<n>.

This is a Railway timing issue, not a PR-specific logic problem. Retrying environment and service discovery makes the preview workflow resilient to Railway's eventual consistency while still failing quickly if discovery never succeeds.

Test Plan

  • bash -n .github/scripts/preview/*.sh
  • Parse .github/workflows/preview-environments.yml with yaml.safe_load
  • git diff --check origin/main...HEAD
  • Verify the fixed head passes Provision Tier 1 (Railway) on this PR
  • Verify the full preview workflow passes end to end on this PR
  • Verify pr-review, ci, and Cypress E2E Tests pass on this PR

Notes

This is intentionally scoped to the pre-existing Railway provisioning flake that surfaced while validating the stable preview URL comment change in #2799.

@changeset-bot
Copy link

changeset-bot bot commented Mar 23, 2026

⚠️ No Changeset found

Latest commit: 29365d9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Mar 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 23, 2026 2:32pm
agents-docs Ready Ready Preview, Comment Mar 23, 2026 2:32pm
agents-manage-ui Ready Ready Preview, Comment Mar 23, 2026 2:32pm

Request Review

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 23, 2026

TL;DR — Adds retry-with-jitter loops around Railway preview environment and service ID lookups so that transient API delays no longer break TCP proxy setup during CI preview deployments, and ensures discovery failures propagate correctly via || return 1.

Key changes

  • Add railway_wait_for_environment_id and railway_wait_for_service_id_for_env retry wrappers — poll the Railway GraphQL API up to max_attempts times with jittered sleep, replacing the previous single-shot lookups that failed silently on empty responses.
  • Refactor railway_ensure_tcp_proxy to use the new retry helpers — removes inline null-check-and-bail logic in favor of the dedicated wait functions, with || return 1 to propagate failures.

Summary | 1 file | 2 commits | base: mainvarun/preview-railway-service-discovery


Retry loop for Railway service discovery

Before: railway_ensure_tcp_proxy called railway_environment_id and railway_service_id_for_env once each, failing immediately if either returned empty — a common occurrence when Railway's API lags behind environment creation.
After: Two new helpers (railway_wait_for_environment_id, railway_wait_for_service_id_for_env) wrap those calls in a configurable retry loop (default 30 attempts, 2 s jittered sleep), and railway_ensure_tcp_proxy delegates to them with || return 1 to propagate failures.

Both helpers share the same pattern: loop up to max_attempts, call the underlying lookup, return early on success, and sleep_with_jitter between attempts. The existing max_attempts and sleep_seconds parameters from railway_ensure_tcp_proxy are forwarded through, so callers keep the same interface.

Why jittered sleep instead of fixed backoff? The script already defines a sleep_with_jitter utility used elsewhere in the file. Reusing it avoids thundering-herd effects when multiple preview services retry concurrently during the same CI run.

.github/scripts/preview/common.sh

Pullfrog  | View workflow run | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean change — the retry wrappers follow the same pattern as railway_extract_runtime_var and correctly address the Railway eventual-consistency flake. One observation about the retry budget worth considering.

Pullfrog  | View workflow runpullfrog.com𝕏

Comment on lines +251 to +252
env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocking): railway_ensure_tcp_proxy forwards its full max_attempts/sleep_seconds to each of the two new wait functions, then runs its own polling loop with the same budget. Worst-case wall time tripled from ~60 s to ~180 s (30 × 2 s × 3 phases) compared to the previous single-phase version.

This is probably fine in practice — the environment and service IDs should resolve quickly — but it's worth being aware of. If CI timeout pressure becomes a concern later, consider giving the discovery phases a smaller attempt budget (e.g., 10) while keeping the TCP-proxy polling at 30.

Copy link
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — approving.

Pullfrog  | View workflow runpullfrog.com𝕏

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(1) Total Issues | Risk: Medium

🟠⚠️ Major (1) 🟠⚠️

Inline Comments:

  • 🟠 Major: common.sh:251-252 Missing error propagation from wait functions — failures waste 60s retrying with invalid input

💭 Consider (4) 💭

💭 1) common.sh:196,221 Add progress logging during retry loops

Issue: When retry loops take up to 60 seconds, operators have no visibility into whether the script is progressing or stuck.

Why: Silent retry loops make incident debugging harder since there's no indication of which attempt failed or how long the wait has been.

Fix: Add progress logging to stderr after the first attempt:

if [ "${attempt}" -gt 1 ]; then
  echo "Waiting for environment ID (attempt ${attempt}/${max_attempts})..." >&2
fi

💭 2) common.sh:191,217 Default max_attempts differs from existing pattern

Issue: The new wait functions default to max_attempts=30 while the existing railway_extract_runtime_var uses max_attempts=20.

Why: This inconsistency could lead to different retry behavior expectations. 30 attempts × ~2s = ~60-90s max wait vs 20 × ~2s = ~40-60s for the existing function.

Fix: Consider aligning with the existing default (20) or documenting why service discovery needs more attempts.

Refs:

💭 3) system Be aware of cascading retry latency

Issue: The railway_ensure_tcp_proxy function now chains two wait functions plus its own internal retry loop, resulting in worst-case latency of ~180 seconds.

Why: Environment wait (30 × 2s) + service wait (30 × 2s) + TCP proxy wait (30 × 2s) = 180s total. This is acceptable for CI but operators should be aware of the timeout budget.

Fix: Consider documenting the total timeout budget or adding a comment noting the worst-case latency.

💭 4) common.sh:191 Consider exponential backoff

Issue: The retry strategy uses fixed 2-second base interval with jitter rather than exponential backoff.

Why: Exponential backoff is the industry standard for retry patterns and would reduce load on Railway's API during extended outages. However, the current approach follows the existing railway_extract_runtime_var pattern in the file.

Fix: For consistency, keeping linear backoff is reasonable. If changing, consider: sleep_seconds * (2 ** min(attempt - 1, 4)) * jitter capped at 30s.

Refs:


🚫 REQUEST CHANGES

Summary: The retry logic approach is sound and addresses the Railway eventual consistency issue. However, the removal of explicit error handling creates a regression: if environment lookup fails, the script wastes ~60 seconds retrying service lookup with an empty environment ID before failing with a confusing error. Adding || return 1 to both wait function calls (see inline comment) restores defensive error handling without relying on caller shell strictness.

Discarded (0)

No findings were discarded.

Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 1 0 0 0 1 0 0
pr-review-sre 5 0 3 0 1 0 1
pr-review-devops 2 0 1 0 1 0 0
Total 8 0 4 0 1 0 1

Note: Error propagation finding was raised by all 3 reviewers and consolidated into single inline comment. SRE's exponential backoff suggestion was kept as Consider since it conflicts with existing file patterns.

Comment on lines +251 to +252
env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Missing error propagation from wait functions

Issue: If railway_wait_for_environment_id fails and returns exit code 1, the script continues with an empty env_id. The subsequent railway_wait_for_service_id_for_env call then retries 30 times with an empty environment ID before failing with a misleading error.

Why: Since this file is sourced and "callers own shell strictness" (per line 2), set -e cannot be assumed. The original code had explicit -z checks that prevented this cascading failure. Without explicit error handling, failures in the first lookup waste ~60 seconds retrying the second lookup with invalid input.

Fix: (1-click apply)

Suggested change
env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")"
env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1
service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1

Refs:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54759a603 by restoring explicit error propagation on both wait-command substitutions with || return 1. That preserves the original fail-fast behavior when environment discovery fails.

@github-actions github-actions bot deleted a comment from claude bot Mar 23, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

Re-review Delta Assessment

The previous review identified 1 MAJOR issue: missing error propagation from wait functions at lines 251-252. The author addressed this in commit 54759a603 by adding || return 1 to both wait function calls.

Verification:

# Before (missing error propagation)
env_id="$(railway_wait_for_environment_id ...)"
service_id="$(railway_wait_for_service_id_for_env ...)"

# After (correct error propagation)
env_id="$(railway_wait_for_environment_id ...)" || return 1
service_id="$(railway_wait_for_service_id_for_env ...)" || return 1

The fix is correct and complete:

  • If railway_wait_for_environment_id fails, the function now returns immediately with exit code 1
  • If railway_wait_for_service_id_for_env fails, the function returns immediately with exit code 1
  • No wasted retry cycles with empty/invalid input

🕐 Pending Recommendations (4)

The following suggestions from the previous review remain open for the author's consideration. These are non-blocking improvements:


✅ APPROVE

Summary: The blocking issue from the previous review has been correctly addressed. Error propagation now works as expected — failures in environment or service discovery will immediately abort railway_ensure_tcp_proxy rather than wasting ~60 seconds retrying with invalid input. The retry logic cleanly handles Railway's eventual consistency without introducing silent failures. Ship it! 🚀

Reviewers (0)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
Total 0 0 0 0 0 0 0

Note: No reviewers dispatched for this delta re-review. The change was a 2-line fix directly addressing the previous MAJOR finding. Manual verification confirmed correctness.

@github-actions github-actions bot deleted a comment from claude bot Mar 23, 2026
@itoqa
Copy link

itoqa bot commented Mar 23, 2026

Ito Test Report ✅

15 test cases ran. 15 passed.

All 15 test cases passed with zero failures, indicating no likely production-code defects across preview-environment provisioning, workflow routing, diagnostics/error handling, UI auth navigation, and security-critical API/auth paths. Key validated behaviors included successful and idempotent Tier 1 provisioning with TCP proxies converging to ACTIVE (including partial-readiness, transient create-failure recovery, and rapid concurrent dispatches without race issues), correct closed-PR teardown-only gating, failure diagnostics that redact secrets while preserving actionable context, explicit missing-service errors that block downstream jobs, and consistent rejection of unauthorized access patterns (wrong passwords, tampered/stale cookies, cross-tenant path tampering, and origin-mismatched sign-in) while mobile deep-link and refresh/back-forward flows maintained coherent protected-content access rules.

✅ Passed (15)
Category Summary Screenshot
Adversarial Three rapid concurrent provisioning simulations completed without race-induced fatal lookup or duplicate-proxy errors. ADV-1
Adversarial Wrong-password sign-in attempts were denied and unauthenticated manage access stayed unauthorized. ADV-2
Adversarial Valid local sign-in issued a cookie, while tampered and stale cookies were rejected without data leakage. ADV-3
Adversarial Valid tenant access succeeded and cross-tenant path tampering was forbidden. ADV-4
Adversarial Origin-mismatched sign-in was rejected with no usable privileged session. ADV-5
Adversarial After abusive refresh/back/forward during active login transition, app settled to a coherent unauthenticated state and protected API returned 401, confirming no partial authorization state. ADV-6
Edge Provisioning simulation converged with all three TCP proxies ready and no terminal unresolved environment/service ID failures. EDGE-4
Edge Confirmed missing service resolution fails with an explicit unresolved-service error including service/env context, and dependent jobs remain gated behind provision-tier1 success. EDGE-5
Edge Transient create-environment failure recovery path re-checks existence and continues without a product defect. EDGE-6
Edge On mobile viewport, protected deep-link redirected to login pre-auth; after valid login, refresh/back-forward navigation remained stable and protected projects rendered only when authenticated. EDGE-7
Logic TCP proxy readiness polling converged to ACTIVE for provisioned services in inspected successful runs. LOGIC-3
Logic Triggered a safe smoke failure and confirmed diagnostics preserved actionable status details while redacting credentials, tokens, DB URLs, and session cookie values. LOGIC-4
Happy-path Manual dispatch flow reached Provision Tier 1 through smoke checks without a product-code defect signal. ROUTE-1
Happy-path Existing PR environment rerun behavior remained stable and idempotent in the verified workflow path. ROUTE-2
Happy-path Closed-PR routing in the preview workflow is correctly constrained to teardown-only execution gates. ROUTE-3

Commit: 54759a6

View Full Run


Tell us how we did: Give Ito Feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant