Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should crash if configured collectors are missing #57

Open
paalbra opened this issue May 23, 2023 · 6 comments
Open

Should crash if configured collectors are missing #57

paalbra opened this issue May 23, 2023 · 6 comments

Comments

@paalbra
Copy link
Contributor

paalbra commented May 23, 2023

If you have a configured collector, but the collector is missing, zac will just continue:

try:
module = importlib.import_module(source_collector_values.module_name)
except ModuleNotFoundError:
logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir)
continue

This might lead to unexpected behavior and bad results. It's probably better to raise a ZACException and crash? Thoughts?

@pederhan
Copy link
Member

pederhan commented May 24, 2023

We don't have the manpower to babysit bad source collectors and host modifiers, so failing straight up would not be ideal for us. If we are provided with a bad source collector, it shouldn't interfere with the good ones.

We discussed this issue today, and want to implement a system wherein source collectors (and modifiers?) are disabled after failing X times in Y seconds. This allows us to integrate potentially unstable source collectors without jeopardizing the overall stability of ZAC.

Some source collectors might be more unstable than others; would it make sense for an overall failure tolerance setting, which can be optionally overridden by individual source collectors? Something like this for global configuration:

[zac.source_collectors]
failure_tolerance = 5
failure_interval = 300

And this for individual source collector configuration:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
failure_tolerance = 10
failure_interval = 180

Alternatively:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60

[source_collectors.mysource.failure]
tolerance = 10
interval = 180

These are the names suggested by ChatGPT (I struggled to come up with names myself):

  1. Alternative names for X (threshold for number of tolerated failures):

    • failure_tolerance
    • failure_threshold
    • failure_limit
    • failure_count
    • failure_maximum
  2. Alternative names for Y (duration in which failures are measured):

    • failure_duration
    • failure_window
    • failure_measurement_period
    • failure_timeframe
    • failure_interval

Finally, to support your proposed behavior there could be a configuration option that bypasses this error tolerance strategy entirely (struggling to come up with a good name for this):

[zac.source_collectors]
exit_on_failure = true # immediately exit upon encountering a failing source collector

@paalbra
Copy link
Contributor Author

paalbra commented May 24, 2023

Now we're not talking about the same issue. Missing collector != failing collector (or failing modifier).

Missing collector, this issue

A missing collector means that all hosts collected by the collector should be deleted from zabbix (might not really be true as long as this issue persists #11 ). This is quite severe. I think ZAC should crash rather than continue, as stated in this issue (even though the failsafe probably will trigger. There is no need to rely on the safety net when you know you're failing).

Failing collector

This isn't a problem? There is nothing to fix?

It will raise an exception:

raise exceptions.ZACException(f"Unable to collect from module ({self.config['module_name']}): {str(e)}")

Log and go into a bad state (which can be monitored):

try:
self.work()
self.state["ok"] = True
except exceptions.ZACException as e:
logging.error("Work exception: %s", str(e))
self.state["ok"] = False

This means that the current collect run is discarded and ZAC will retry with the normal interval. If the next run is ok the state will also become ok.

(I'm pretty sure this happens from now and then at UiO, with e.g. Nivlheim being unavailable, if I remember correct?)

Failing modifier

Then a host is just not modified:

except Exception as e:
logging.warning("Error when running modifier %s on host '%s': %s", host_modifier["name"], hostname, str(e))
# TODO: Do more?

This could lead to a weird state/unknown problems (very dependent on the modifier, obviously). You can tell I was wondering what to do from the comment. I now think that the current host that is being modified should probably be discarded (kind of like a failing collector), but this is really another issue than what I'm mentioning here.

@pederhan
Copy link
Member

Ok, my bad, I can create a separate issue. But don't you think there should consistent behavior both in terms of loading and running external modules?

Loading a modifier:

for module_name in module_names:
module = importlib.import_module(module_name)
try:
assert callable(module.modify)
except (AttributeError, AssertionError):
logging.warning("Host modifier is missing 'modify' callable. Skipping: '%s'", module_name)
continue

Loading a collector:

for source_collector_name, source_collector_values in config.source_collectors.items():
try:
module = importlib.import_module(source_collector_values.module_name)
except ModuleNotFoundError:
logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir)
continue

In the case of collectors, we at least check if the import was successful or not. Even if we're fairly sure a module exists when loading it (in the case of host modifiers), it might contain invalid syntax or otherwise be broken. So some sort of unified interface for loading modules would be nice:

import importlib
from types import ModuleType


def load_module(name: str) -> ModuleType:
    try:
        return importlib.import_module(name)
    except Exception as e:
        if isinstance(e, ModuleNotFoundError):
            msg = f"Unable to find module named {name!r}"
        else:
            msg = f"Unable to import module named {name!r}: {str(e)}"
        logger.error(msg)
        raise ZACException(msg) from e

And then based on the config, callers of this function could decide whether or not a ZACException should be fatal.

@paalbra
Copy link
Contributor Author

paalbra commented May 25, 2023

I already created #58 .

I think collectors and modifiers currently behave differently and probably can't really be looked at like some general external module.

One is explicitly stated in the config. The other is "just there".

I'm happy to hear suggestions, but I still think this issue is valid: zac shouldn't start if collectors (explicitly stated in the config) are missing.

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

@pederhan
Copy link
Member

pederhan commented May 25, 2023

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

Not by itself no, but callers of this function would know that they only need to handle ZACException, which is more predictable than all the possible error states from importlib.import_module. Furthermore, it would deduplicate the import code used for collectors and modified.

So instead of

def get_source_collectors(config):
source_collector_dir = config.zac.source_collector_dir
sys.path.append(source_collector_dir)
source_collectors = []
for source_collector_name, source_collector_values in config.source_collectors.items():
try:
module = importlib.import_module(source_collector_values.module_name)
except ModuleNotFoundError:
logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir)
continue

we would do:

def get_source_collectors(config: models.Settings) -> List[SourceCollectorDict]:
    source_collector_dir = config.zac.source_collector_dir
    sys.path.append(source_collector_dir)

    source_collectors = []  # type: List[SourceCollectorDict]
    for (source_collector_name, source_collector_values) in config.source_collectors.items():
        try:
            load_module(source_collector_values.module_name)
        except ZACException as e:
            if config.zac.source_collectors.ignore_missing:
                logging.warning(
                    "Source collector '%s' import failed: '%s'",
                    source_collector_values.module_name,
                    str(e),
                )
                continue
            raise
       # exceptions not handled by `load_module` will always fail
    # ...

So yes, it "only" wraps the error, and we could do a bare try...except Exception:, but this way we always log when an import fails in load_module, but the caller decides how to handle the missing import. The same load_module function should of course be re-used when importing host modifiers.

Anyway, I don't feel too strongly about this, but I think fundamentally importing modifiers and collectors is the same thing; we are doing dynamic imports on runtime, and there should be predictable behavior shared between the two.

@paalbra
Copy link
Contributor Author

paalbra commented May 25, 2023

I have no strong opinion when it comes to wrapping in a load_module. It could be useful.

My only strong opinion in this issue is that zac should fail rather than continue when a collector is missing. It's a severe error.

There is a big difference here between collectors and modifiers since only collectors can be missing. Zac doesn't know what modifiers to expect. Maybe this should change in the future. Maybe zac should have configured a list of enabled modifiers. I think that's not a bad idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants