Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS: detection of qstat failures no longer working #6531

Open
dpmatthews opened this issue Dec 19, 2024 · 1 comment
Open

PBS: detection of qstat failures no longer working #6531

dpmatthews opened this issue Dec 19, 2024 · 1 comment
Labels
bug Something is wrong :(
Milestone

Comments

@dpmatthews
Copy link
Contributor

#2691 added code to detect qstat failures by searching for "Connection refused" in stderr. However this is not working on our new system which is resulting in jobs being incorrectly reported as failed when polled.

Information at the time indicated we could expect to see errors like this if qstat failed to contact the server:

Connection refused
qstat: cannot connect to server xxxxxx (errno=111)

However, we now seeing errors like this from PBS 2022.1.7:

Connection timed out
qstat: cannot connect to server xxxxxx (errno=xxxxx)

For the moment I think we would be safe to change the search string to "cannot connect" (or possibly "errno").

Longer term we should consider other ways to make the polling more robust, see #3436

@dpmatthews dpmatthews added the bug Something is wrong :( label Dec 19, 2024
@dpmatthews dpmatthews added this to the 8.4.1 milestone Dec 19, 2024
@dpmatthews
Copy link
Contributor Author

I've found I can trigger various PBS errors by setting the PBS_SERVER env variable:

PBS 2022.1.7:

$ PBS_SERVER=localhost qstat
Connection refused
qstat: cannot connect to server localhost (errno=15010)

$ PBS_SERVER=dummy qstat
Unknown Host.
qstat: cannot connect to server dummy (errno=15008)

$ PBS_SERVER=google.com qstat
Connection timed out
qstat: cannot connect to server google.com (errno=15010)

PBS 18.2.6:

$ PBS_SERVER=localhost qstat
Connection refused
qstat: cannot connect to server localhost (errno=111)

$ PBS_SERVER=dummy qstat
Unknown Host.
qstat: cannot connect to server dummy (errno=15008)

$ PBS_SERVER=google.com qstat
Connection timed out
qstat: cannot connect to server google.com (errno=110)

I think we should be safe to change the search string to "cannot connect to server".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

1 participant