Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Edge consistently times out when collecting from specific WPT segment #644

Open
jugglinmike opened this issue Jan 29, 2019 · 4 comments
Open
Assignees

Comments

@jugglinmike
Copy link
Collaborator

When running tests in Edge via Sauce Labs, we currently divide WPT into 100 segments (referred to by the WPT infrastructure as "chunks"). On 2018-12-26, the collection from chunk 10 of 100 appeared to complete successfully, but the Buildbot master did not detect completion, and after 20 minutes, the job timed out. This has occurred for that chunk with regularity in the weeks that have passed since.

Example of logs (from 2019-01-28)
2019-01-28 14:05:26,345 INFO validate-wpt-results wpt-run:stdout Testing ed53b0478faeb87abd71cd9972d4e2b7a6103241 == a7c05b88682fd5c1adb58984bb0a9edf46daee95
2019-01-28 14:05:26,345 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Got signal terminated
2019-01-28 14:05:26,442 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Cleaning up.
2019-01-28 14:05:26,442 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Removing tunnel 9f7ba991b73046a593aab88fbfebe745.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Waiting for any active jobs using this tunnel to finish.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Press CTRL-C again to shut down immediately.
2019-01-28 14:05:26,443 INFO validate-wpt-results wpt-run:stdout 28 Jan 14:05:26 - Note: if you do this, tests that are still running will fail.
2019-01-28 14:05:31,267 INFO validate-wpt-results wpt-run:stdout 108:10.77 INFO Closing logging queue
2019-01-28 14:05:31,270 INFO validate-wpt-results wpt-run:stdout 108:10.77 INFO queue closed
2019-01-28 14:05:31,577 INFO validate-wpt-results WPT CLI exited with return code 1
2019-01-28 14:05:31,632 INFO validate-wpt-results Expected 1229 results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 1229 results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 0 unexpected results
2019-01-28 14:05:31,633 INFO validate-wpt-results Found 0 missing results

command timed out: 1200 seconds without output running ['run-and-verify.py', '--max-attempts', '3', '--log-wptreport', '/tmp/tmpXvRNyX/report.json', '--log-raw', '/tmp/tmpXvRNyX/log-raw.txt', '--', '--log-mach', '-', '--this-chunk', '10', '--total-chunks', '100', '--sauce-platform', 'windows 10', '--sauce-user', 'wpt-shiek', '--sauce-key', '<sauce_labs_key_shiek>', '--sauce-tunnel-id', 'shiek', '--sauce-connect-binary', 'sc', '--sauce-connect-arg=--logfile=/var/log/sauce-connect/sc.log', '--sauce-init-timeout', '45', '--no-restart-on-unexpected', '--run-by-dir', '3', 'sauce:MicrosoftEdge:17'], attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- exceptions.RuntimeError: SIGKILL failed to kill process
]

Between Buildbot, the "retry" wrapper script, the WPT CLI, and the Sauce Connect binary, the process tree is pretty deep for this code. My initial thought is that some thread is active and causing one of these processes to not report as complete, though the message "WPT CLI exited with return code 1" suggests this is happening at a pretty high level.

@jugglinmike jugglinmike self-assigned this Jan 29, 2019
@jugglinmike
Copy link
Collaborator Author

I haven't been able to find a conclusive explanation for this behavior, nor have I been able to reproduce it under more controlled circumstances (running "chunk" number 10 in Edge takes about 2 hours).

That said, I did come across a warning in the Python documentation which is related to the APIs we're using here. It doesn't exactly describe the problem we're seeing because our logs include the message "WPT CLI exited with return code 1" which is only emitted after the dangerous function invocation.

Still, if the pattern is known to be susceptible to deadlocking, then we ought to avoid it. I've pushed up a commit to do just that. It's possible that this will also address our issue. While I'd certainly appreciate a more conclusive resolution, we need to time box investigation on this latest glitch.

@jugglinmike
Copy link
Collaborator Author

Two unrelated issues prevented me from validating the commit reference above:

@jugglinmike
Copy link
Collaborator Author

Following a couple of successes, we failed to collect from the segment in question for the very same reason. I'll continue my effort to reproduce the bug locally and find the true fix.

@jugglinmike
Copy link
Collaborator Author

When this issue is expressed, the Sauce Labs logs are conspicuously short. They abruptly end when the browser is accessing resources for the tests at html/infrastructure/urls/resolving-urls/query-encoding. It's odd that the WPT logs are complete in these cases, describing the results for all subsequent tests.

My working theory is that the Sauce Connect binary is falling into an uninterruptable sleep.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant