Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gearmand hanging and non responsive after OMD 3.30 upgrade #107

Closed
infraweavers opened this issue Aug 11, 2020 · 27 comments
Closed

gearmand hanging and non responsive after OMD 3.30 upgrade #107

infraweavers opened this issue Aug 11, 2020 · 27 comments

Comments

@infraweavers
Copy link
Contributor

After upgrading to OMD 3.30 from 3.10 we find that gearmand hangs and is no longer able to be connected to, meaning all the checks timeout. We use mod_gearman to provide dupserver which we then submit the results to our secondary OMD box.

omd restart gearmand fails to kill the process and the only solution is to kill -9 gearmand.

cantstop

We have core files of gearmand when it's in this state and ran strace against it, which showed no activity occurring in the process.

strace_gdbattached

The error from running gearman_top when its hung is:

failed to connect to address localhost and port 4730: Interrupted system call.

Gearman_top

which is very unusual; if gearmand is stopped or unavailable, connection refused is the error.

We think this probably related to the upgrade from gearmand 0.33 that we think we occurred in OMD 3.20? Where is the best place to discuss the issue, is here fine?

@sni
Copy link
Contributor

sni commented Aug 11, 2020

the place to discuss this is fine. I see this on one box occasionally as well but haven't found time to dig deeper into it.

@infraweavers
Copy link
Contributor Author

We've tried to reproduce this in a lab by copying the naemon check configuration and all the plugins onto a test OMD 3.30 install, the problem has not reocurred in the lab; so we're now wondering if it's somethign specific with the output of one of our checks or similiar. So we've built another OMD 3.30, this time in the same environment as the ones we upgraded previous and it's running all the same checks in parallel to OMD3.10; the only difference in configuration between the 2 is Keepalived and Postfix are both disabled on the OMD3.30; and the dupserver is commented out on OMD3.30 at the moment. If we can't reproduce the problem in a few hours, we'll create another OMD3.30 and configure the dupserver so it is identical to the upgraded pair we had the issue on above.

@infraweavers
Copy link
Contributor Author

This has just been reproduced now, so presumably the specific checks we're running are important.

@sni
Copy link
Contributor

sni commented Aug 12, 2020

thats strange, because mod-gearman encrypts and converts the crypted payload with base64 to avoid such problems. Except for the uniq identifier. So it might have to do something with a specific host or servicename.

@infraweavers
Copy link
Contributor Author

infraweavers commented Aug 12, 2020

We've also just had this scenario (-1 jobs) appear whilst trying to narrow it down:

image

Looks like gearadmin --status wraps the integer:
image

@infraweavers
Copy link
Contributor Author

@sni we use gcore to get the coredumps, we think we need a debug build of gearmand with symbols to feed into gdb do you know how we can build one or get the required files?

@infraweavers
Copy link
Contributor Author

We think we got this working, looks like just building gearmand from OMD 3.30's release commit was enough:

image

@sni
Copy link
Contributor

sni commented Aug 13, 2020

yeah, you can simply run make from the packages/gearmand folder. Eventually you need to run ./configure from the base folder to install all build dependencies. And you might want to adjust some cflags to increase debug symbols if necessary.

Seems that gearmands internal state gets confused somehow.

@infraweavers
Copy link
Contributor Author

infraweavers commented Aug 13, 2020

We just got it reproduced again, however this time we were tailing the gearmand log,
image

We saw the gearman_top have -1 in the check_results, that correllated with the log appearing; shortly after the gearmand stopped responding completely.

@infraweavers
Copy link
Contributor Author

infraweavers commented Aug 14, 2020

We managed to catch and attach to this happening, and it's basically in an infinite loop between:
https://github.com/gearman/gearmand/blob/master/libgearman-server/worker.cc#L124-L131

@sni
Copy link
Contributor

sni commented Aug 14, 2020

great debugging, i guess its time to get the gearmand people involved.

@infraweavers
Copy link
Contributor Author

infraweavers commented Aug 14, 2020

We've opened an issue here: gearman/gearmand#301 we think that's the right place for it

@infraweavers
Copy link
Contributor Author

@sni We do use pnp_gearman_worker for our graphing on the affected environment; we'll make this change (dd8b990) on our test OMD 3.30 box and see if it makes any difference to it.

@sni
Copy link
Contributor

sni commented Aug 28, 2020

are you using pnp in gearman mode?

@infraweavers
Copy link
Contributor Author

Yes

@sni
Copy link
Contributor

sni commented Aug 28, 2020

ok, then it might be the same issue.

@infraweavers
Copy link
Contributor Author

@sni We patched our test OMD 3.30 box on the 28th and have it running since. It doesn't seem to have locked up with interrupted system call in the few days we've been running it, whereas before it would almost certainly have done it once by now. So it seems like dd8b990 may have worked around the gearmand bug in our environment. We'll keep testing it for a bit and see if anything changes.

@sni
Copy link
Contributor

sni commented Sep 2, 2020

same here, no outage since then. Looks good.

@infraweavers
Copy link
Contributor Author

@sni just tried promoting one of our actual production boxes up to 3.30 and applied dd8b990, unfortunately it just broke again with the exact same behaviour as before, so it doesn't look like it's fixed it for us :(

@sni
Copy link
Contributor

sni commented Sep 9, 2020

still no issues here since with the latest nightly

@infraweavers
Copy link
Contributor Author

Interesting, we're testing with the nightly rather than manually applying dd8b990 to see if we get it again

@infraweavers
Copy link
Contributor Author

@sni Just checking through the mod_gearman code for anything looking "weird"; noticed this: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L97. Is that definitely the connection to gearmand network timeout and not a job timeout (i.e. CAN_DO_TIMEOUT in gearman parlance) ?

@infraweavers
Copy link
Contributor Author

Nightly has been running for a few hours now and we've not seen the problem yet

@infraweavers
Copy link
Contributor Author

We're going to upgrade our secondary at our first site to the nightly and see if the stability remains. It's looking pretty good so far, however!

@infraweavers
Copy link
Contributor Author

Looking stable so far, there does seem to be a pause in the behaviour of checks when the dupserver is unavailable; however that is a totally separate issue from this. The good news is, it does seem to be fixed so far...

@infraweavers
Copy link
Contributor Author

Just reporting in, this has been stable for another week; I'm pretty convinced this has worked around the gearmand problem now :)

@sni
Copy link
Contributor

sni commented Sep 28, 2020

Thats great to hear. Will close this then

@sni sni closed this as completed Sep 28, 2020
sni added a commit to sni/pnp4nagios that referenced this issue May 20, 2021
this patch addresses some issues in gearman worker mode:

    - when forking a child, the forked %children contains all created childs so far, so if that child receives a sigint, it will kill its siblings.
    - check for exited childs in a loop in case multiple childs exited at once
    - do not remove the pidfile when a children receives a sigint
    - fix issue with gearman jobs having a timeout, for details see
        - gearman/gearmand#301
        - ConSol-Monitoring/omd#107
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants