-
-
Notifications
You must be signed in to change notification settings - Fork 25
chore(ci): Work on improving flaky Windows playwright smoke tests #1756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
97811db
chore(ci): Work on improving flaky Windows playwright smoke tests
Tobbe 26664f5
Wholesale switch to 127.0.0.1
Tobbe 42bd468
fix more localhost vs 127.0.0.1
Tobbe c2a63af
new investigation update
Tobbe cae417f
dbAuth secret gen
Tobbe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
124 changes: 124 additions & 0 deletions
124
docs/implementation-plans/flaky-windows-smoke-tests-investigation.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| # Investigation: Flaky Windows Smoke Tests (`ERR_CONNECTION_REFUSED`) | ||
|
|
||
| ## Summary | ||
|
|
||
| Multiple smoke test suites (`serve`, `fragments-serve`, `prerender`) fail | ||
| intermittently on Windows CI runners with `net::ERR_CONNECTION_REFUSED` when | ||
| trying to reach `http://localhost:8910`. The failure is not deterministic — | ||
| sometimes the suite passes entirely, sometimes it fails partway through. | ||
|
|
||
| This document tracks the investigation and applied mitigations. | ||
|
|
||
| ## Observed Failure Pattern | ||
|
|
||
| From run | ||
| [25702624909](https://github.com/cedarjs/cedar/actions/runs/25702624909/job/75468866374) | ||
| (PR #1754, Windows fragments smoke tests): | ||
|
|
||
| - 10 tests total, 3 passed, 7 failed across all retries | ||
| - The 3 passing tests use `noJsBrowser` to navigate to pre-rendered pages | ||
| - Every `page.goto()` call fails with `ERR_CONNECTION_REFUSED` | ||
| - `page.waitForResponse()` times out after 30 seconds | ||
| - All 3 retries per test produce the same error — the server does not recover | ||
|
|
||
| ### Chronology from CI Logs | ||
|
|
||
| | Time (UTC) | Event | | ||
| | ---------- | ------------------------------------------------------ | | ||
| | 23:40:48 | First API server starts on port 8911 | | ||
| | 23:41:05 | `yarn cedar build` prerendering begins | | ||
| | 23:41:09 | **Apollo GraphQL error** during prerender of `/double` | | ||
| | 23:41:18 | `yarn cedar serve` starts for tests (PID 6460) | | ||
| | 23:41:19 | Web server reports listening on `127.0.0.1:8910` | | ||
| | 23:42:08 | Server **restarts** — new process (PID 5924) | | ||
| | 23:42:09 | Web server listening again | | ||
| | 23:44:32 | All remaining tests fail with `ERR_CONNECTION_REFUSED` | | ||
|
|
||
| ### Apollo Error During Build | ||
|
|
||
| ``` | ||
| An error occurred! For more details, see the full error text at | ||
| https://go.apollo.dev/c/err#%7B%22version%22%3A%223.14.1%22%2C%22message%22%3A17%2C%22args%22%3A%5B%5D%7D | ||
| ❯ Prerendering /double -> web/dist/double.html | ||
| ✔ Prerendering /double -> web/dist/double.html | ||
| ``` | ||
|
|
||
| The error occurs during `yarn cedar build` (prerender phase) when fetching | ||
| GraphQL data for the `/double` page. Despite the error, the prerender step | ||
| completes and produces `web/dist/double.html`. The error code (`"message":17`) | ||
| is an Apollo Client internal error from version 3.14.1. | ||
|
|
||
| ## Potential Root Causes | ||
|
|
||
| ### 1. `localhost` DNS Resolution (Mitigated) | ||
|
|
||
| On Windows, `localhost` can resolve to `::1` (IPv6) instead of `127.0.0.1` | ||
| (IPv4). If the web server binds to IPv4 only, connections via `localhost` fail. | ||
| The Playwright configs used `localhost` in both `webServer.url` and | ||
| `use.baseURL`. | ||
|
|
||
| **Mitigation:** Changed all three configs to use `127.0.0.1` directly. | ||
|
|
||
| ### 2. Insufficient Server Startup Timeout (Mitigated) | ||
|
|
||
| Playwright's `webServer` checks the URL to determine readiness, but the default | ||
| 60-second timeout may be too tight on Windows CI runners where `yarn cedar serve` | ||
| performs a lot of work (compiling, generating, starting both API and web | ||
| servers). | ||
|
|
||
| **Mitigation:** Increased `webServer.timeout` to 120 seconds on CI. | ||
|
|
||
| ### 3. Server Crash During Test Execution (Unresolved) | ||
|
|
||
| The server restart at 23:42:08 (different PID) indicates the original process | ||
| died. Possible causes: | ||
|
|
||
| - A specific page (`/double` with the Apollo error) triggers an unhandled | ||
| exception in the web server | ||
| - Memory pressure on the CI runner causes the OS to kill the process | ||
| - The prerender build step leaves port 8910 in a bad state | ||
|
|
||
| This needs further investigation — reproduce the crash locally with the same | ||
| build artifact and test page. | ||
|
|
||
| ### 4. Apollo Error in Prerender Build (Needs Investigation) | ||
|
|
||
| The Apollo error during `/double` prerender may produce a broken HTML file that | ||
| the server then crashes trying to serve or that triggers a crash during client | ||
| rehydration. The smoke test for `/double` (`Check that rehydration works for | ||
| page not wrapped in Set`) uses `expect(errors).toMatchObject([])` which | ||
| silently ignores errors. | ||
|
|
||
| ## Applied Mitigations | ||
|
|
||
| ### Changed Files | ||
|
|
||
| - `tasks/smoke-tests/serve/playwright.config.ts` | ||
| - `tasks/smoke-tests/fragments-serve/playwright.config.ts` | ||
| - `tasks/smoke-tests/prerender/playwright.config.ts` | ||
|
|
||
| ### Changes | ||
|
|
||
| 1. `localhost` → `127.0.0.1` in both `baseURL` and `webServer.url` | ||
| 2. Added `timeout: process.env.CI ? 120_000 : 60_000` to `webServer` | ||
|
|
||
| ## Open Questions | ||
|
|
||
| - What is Apollo Client error code 17, and why does it occur during `/double` | ||
| prerender? | ||
| - Does the broken `/double` prerendered HTML cause the server to crash at | ||
| runtime? | ||
| - Why does the server restart once (PID change) but then die permanently? | ||
| - Can we add a health-check endpoint to `cedar serve` so Playwright can verify | ||
| the server is truly ready before tests start? | ||
| - Should the `expect(errors).toMatchObject([])` pattern be replaced with an | ||
| explicit `expect(errors).toEqual([])` to catch hidden errors? | ||
|
|
||
| ## Next Steps | ||
|
|
||
| 1. Monitor CI runs after the `127.0.0.1` and timeout mitigations are merged | ||
| 2. If flakiness persists, investigate the Apollo error during `/double` | ||
| prerender | ||
| 3. Consider replacing `toMatchObject([])` with explicit error assertions in | ||
| rehydration tests | ||
| 4. If confirmed fixed, promote this document to `docs/implementation-docs/` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.