-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gearmand hanging and non responsive after OMD 3.30 upgrade #107
Comments
the place to discuss this is fine. I see this on one box occasionally as well but haven't found time to dig deeper into it. |
We've tried to reproduce this in a lab by copying the naemon check configuration and all the plugins onto a test OMD 3.30 install, the problem has not reocurred in the lab; so we're now wondering if it's somethign specific with the output of one of our checks or similiar. So we've built another OMD 3.30, this time in the same environment as the ones we upgraded previous and it's running all the same checks in parallel to OMD3.10; the only difference in configuration between the 2 is Keepalived and Postfix are both disabled on the OMD3.30; and the |
This has just been reproduced now, so presumably the specific checks we're running are important. |
thats strange, because mod-gearman encrypts and converts the crypted payload with base64 to avoid such problems. Except for the |
@sni we use gcore to get the coredumps, we think we need a debug build of gearmand with symbols to feed into |
yeah, you can simply run Seems that gearmands internal state gets confused somehow. |
We managed to catch and attach to this happening, and it's basically in an infinite loop between: |
great debugging, i guess its time to get the gearmand people involved. |
We've opened an issue here: gearman/gearmand#301 we think that's the right place for it |
are you using pnp in gearman mode? |
Yes |
ok, then it might be the same issue. |
@sni We patched our test OMD 3.30 box on the 28th and have it running since. It doesn't seem to have locked up with |
same here, no outage since then. Looks good. |
still no issues here since with the latest nightly |
Interesting, we're testing with the nightly rather than manually applying dd8b990 to see if we get it again |
@sni Just checking through the mod_gearman code for anything looking "weird"; noticed this: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L97. Is that definitely the connection to gearmand network timeout and not a job timeout (i.e. |
Nightly has been running for a few hours now and we've not seen the problem yet |
We're going to upgrade our secondary at our first site to the nightly and see if the stability remains. It's looking pretty good so far, however! |
Looking stable so far, there does seem to be a pause in the behaviour of checks when the |
Just reporting in, this has been stable for another week; I'm pretty convinced this has worked around the gearmand problem now :) |
Thats great to hear. Will close this then |
this patch addresses some issues in gearman worker mode: - when forking a child, the forked %children contains all created childs so far, so if that child receives a sigint, it will kill its siblings. - check for exited childs in a loop in case multiple childs exited at once - do not remove the pidfile when a children receives a sigint - fix issue with gearman jobs having a timeout, for details see - gearman/gearmand#301 - ConSol-Monitoring/omd#107
After upgrading to OMD 3.30 from 3.10 we find that gearmand hangs and is no longer able to be connected to, meaning all the checks timeout. We use mod_gearman to provide
dupserver
which we then submit the results to our secondary OMD box.omd restart gearmand fails to kill the process and the only solution is to kill -9 gearmand.
We have core files of gearmand when it's in this state and ran strace against it, which showed no activity occurring in the process.
The error from running gearman_top when its hung is:
which is very unusual; if gearmand is stopped or unavailable, connection refused is the error.
We think this probably related to the upgrade from gearmand 0.33 that we think we occurred in OMD 3.20? Where is the best place to discuss the issue, is here fine?
The text was updated successfully, but these errors were encountered: