-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON-RPC Crashes with 2.11 #7532
Comments
I was not able to run Icinga 2 on GCC or attach to it:
|
After disabling the notification feature and some minutes running:
|
I have the exactly same issue. Today i've upgraded from 2.10 to 2.11 and the service started to crash. I also noticed that used memory is way higher than before. |
@bunghi please share logs and details, "exactly" the same won't help us... |
The Icinga2 environment looks like this:
This morning we upgraded icinga2 on all 15 servers (masters + zones endpoints). Since then the both endpoints in the zone with 2 crashed (there we have a lot of hosts). Crash reports:
Since upgraded this morning it crashed 5 times with Out of Memory kernel error. Memory usage after upgrade: |
Memory and CPU usage are expected to raise with the introduction of user land threads with Boost Coroutines. This is different to this issue. Also, the Json error between main and spawn helper process is a new issue, please move this into a dedicated issue, as requested in https://github.com/Icinga/icinga2/issues/7531#issuecomment-534547311 |
@dnsmichi It's not because of the dependencies, otherwise this config would crash:
|
It could be related to JsonEncode seen in various other places as well, maybe related to how memory is allocated and later free'd for encoding dictionaries. |
Hi, Today it crashed again, after a while.. maybe output helps:
|
@lippserd @Al2Klimov my suspicion is that this is related to the JSON library with encode/decode, likewise object serialization and a possible leak in there. I haven't run Valgrind yet, but this would be the next thing to try. |
ref/NC/644339 |
ref/NC/644553 |
I am experiencing exactly the same crash output as @bunghi posted above. I have a 2 node master cluster and around 200 agent instances. Prior to the upgrade to 2.11 the masters were stable, now both nodes crash several times a day.
I've installed gdb on the masters to see if it provides any useful details. Mark |
ref/NC/647127 |
Yesterday (Nov 14th 2019) we experienced the same crash as @bunghi mentioned. Both masters run 2.11.2-1.xenial.
|
I'll take care of that. |
e930efd has purred like a cat for two days. Green light for v2.12rc1. |
I have a similar crash with a large number of endpoints that have been removed from the config but are still attempting to connect. There is about 400 endpoints hammering away and icinga2 can only stay up for around 20hours until I get this error:
Will the latest snapshot discussed here fix this? Or should I lodge a separate bug report. This is a sample of the debuglog:
These messages happen over and over - the debug log gets very large quickly. There are about 10000 hosts behind the collective endpoints by the way, not sure that makes a difference, but some of those endpoints would have a ton of updates not sent. |
@davekempe Please try v2.11.3 once it has been released (later today). |
Please could all of you test v2.11.3 and tell whether it has fixed your particular problem? |
Hey sorry I was going to get back bug the big was closed. Happy to report the issue is fixed. I was able to simulate the problem reliably as it happened every 24 hours in our environment if we removed the endpoints via automation. After the update it has been fine with no crashes. |
Unfortunately, we still have the problem.
|
@hardoverflow ... and it still crashes? |
@Al2Klimov @N-o-X Was the master branch (snapshot packages) affected of this bug? If so the master branch is currently not fixed, since the fixing changes are directly merged into the The master branch should be tested prior a 2.12 release to ensure the bug is fixed there too. |
... and tested successfully. |
@Al2Klimov Once the cluster is running, it is also running. The error occurs sporadically when deploying. After a while, the ConfigMaster crashed. The second master is still running.
We also observe the following network bandwidth for masters and satellites. |
Please share the output of |
|
@hardoverflow Could you please upload core dumps here |
@lippserd Done. Can u confirm? |
Confirmed. All of you: If you give us core dumps, please gzip them – and if you request them, request to gzip them. Not neccessarily all of us have enterprise downlinks due to COVID19. |
This includes the following fixes: nlohmann/json#1436 > For a deeply-nested JSON object, the recursive implementation of json_value::destroy function causes stack overflow. nlohmann/json#1708 nlohmann/json#1722 Stack size nlohmann/json#1693 (comment) Integer Overflow nlohmann/json#1447 UTF8, json dump out of bounds nlohmann/json#1445 Possibly influences #7532
Task List
Mitigations
#7532 (comment)
Analysis
Related issues
#7569
#7687
#7624
#7470
ref/NC/636691
ref/NC/644339
ref/NC/644553
ref/NC/647127
ref/NC/652035
ref/NC/652071
ref/NC/652087
Original Report
The setup is a dual master system which was upgraded to 2.11 around noon yesterday.
In the late evening crashes started to appear and are now consistent. The system ran on 2.11-rc1 before.
The user started upgrading agents to 2.11, this may be related.
ref/NC/636691
Latest crash
Alternative crashes
The text was updated successfully, but these errors were encountered: