Should crash if configured collectors are missing #57

paalbra · 2023-05-23T08:30:46Z

If you have a configured collector, but the collector is missing, zac will just continue:

zabbix-auto-config/zabbix_auto_config/__init__.py

Lines 29 to 33 in 2b45f1c

    
           try: 
        
               module = importlib.import_module(source_collector_values.module_name) 
        
           except ModuleNotFoundError: 
        
               logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir) 
        
               continue

This might lead to unexpected behavior and bad results. It's probably better to raise a ZACException and crash? Thoughts?

The text was updated successfully, but these errors were encountered:

pederhan · 2023-05-24T12:59:48Z

We don't have the manpower to babysit bad source collectors and host modifiers, so failing straight up would not be ideal for us. If we are provided with a bad source collector, it shouldn't interfere with the good ones.

We discussed this issue today, and want to implement a system wherein source collectors (and modifiers?) are disabled after failing X times in Y seconds. This allows us to integrate potentially unstable source collectors without jeopardizing the overall stability of ZAC.

Some source collectors might be more unstable than others; would it make sense for an overall failure tolerance setting, which can be optionally overridden by individual source collectors? Something like this for global configuration:

[zac.source_collectors]
failure_tolerance = 5
failure_interval = 300

And this for individual source collector configuration:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
failure_tolerance = 10
failure_interval = 180

Alternatively:

[source_collectors.mysource]
module_name = "mysource"
update_interval = 60

[source_collectors.mysource.failure]
tolerance = 10
interval = 180

These are the names suggested by ChatGPT (I struggled to come up with names myself):

Alternative names for X (threshold for number of tolerated failures):
- failure_tolerance
- failure_threshold
- failure_limit
- failure_count
- failure_maximum
Alternative names for Y (duration in which failures are measured):
- failure_duration
- failure_window
- failure_measurement_period
- failure_timeframe
- failure_interval

Finally, to support your proposed behavior there could be a configuration option that bypasses this error tolerance strategy entirely (struggling to come up with a good name for this):

[zac.source_collectors]
exit_on_failure = true # immediately exit upon encountering a failing source collector

paalbra · 2023-05-24T14:44:36Z

Now we're not talking about the same issue. Missing collector != failing collector (or failing modifier).

Missing collector, this issue

A missing collector means that all hosts collected by the collector should be deleted from zabbix (might not really be true as long as this issue persists #11 ). This is quite severe. I think ZAC should crash rather than continue, as stated in this issue (even though the failsafe probably will trigger. There is no need to rely on the safety net when you know you're failing).

Failing collector

This isn't a problem? There is nothing to fix?

It will raise an exception:

zabbix-auto-config/zabbix_auto_config/processing.py

Line 106 in 2b45f1c

    
           raise exceptions.ZACException(f"Unable to collect from module ({self.config['module_name']}): {str(e)}")

Log and go into a bad state (which can be monitored):

zabbix-auto-config/zabbix_auto_config/processing.py

Lines 54 to 59 in 2b45f1c

    
           try: 
        
               self.work() 
        
               self.state["ok"] = True 
        
           except exceptions.ZACException as e: 
        
               logging.error("Work exception: %s", str(e)) 
        
               self.state["ok"] = False

This means that the current collect run is discarded and ZAC will retry with the normal interval. If the next run is ok the state will also become ok.

(I'm pretty sure this happens from now and then at UiO, with e.g. Nivlheim being unavailable, if I remember correct?)

Failing modifier

Then a host is just not modified:

zabbix-auto-config/zabbix_auto_config/processing.py

Lines 309 to 311 in 2b45f1c

    
           except Exception as e: 
        
               logging.warning("Error when running modifier %s on host '%s': %s", host_modifier["name"], hostname, str(e)) 
        
               # TODO: Do more?

This could lead to a weird state/unknown problems (very dependent on the modifier, obviously). You can tell I was wondering what to do from the comment. I now think that the current host that is being modified should probably be discarded (kind of like a failing collector), but this is really another issue than what I'm mentioning here.

pederhan · 2023-05-25T07:48:38Z

Ok, my bad, I can create a separate issue. But don't you think there should consistent behavior both in terms of loading and running external modules?

Loading a modifier:

zabbix-auto-config/zabbix_auto_config/processing.py

Lines 236 to 243 in 2b45f1c

    
           for module_name in module_names: 
        
               module = importlib.import_module(module_name) 
        
               try: 
        
                   assert callable(module.modify) 
        
               except (AttributeError, AssertionError): 
        
                   logging.warning("Host modifier is missing 'modify' callable. Skipping: '%s'", module_name) 
        
                   continue

Loading a collector:

zabbix-auto-config/zabbix_auto_config/__init__.py

Lines 28 to 33 in 2b45f1c

    
           for source_collector_name, source_collector_values in config.source_collectors.items(): 
        
               try: 
        
                   module = importlib.import_module(source_collector_values.module_name) 
        
               except ModuleNotFoundError: 
        
                   logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir) 
        
                   continue

In the case of collectors, we at least check if the import was successful or not. Even if we're fairly sure a module exists when loading it (in the case of host modifiers), it might contain invalid syntax or otherwise be broken. So some sort of unified interface for loading modules would be nice:

import importlib
from types import ModuleType


def load_module(name: str) -> ModuleType:
    try:
        return importlib.import_module(name)
    except Exception as e:
        if isinstance(e, ModuleNotFoundError):
            msg = f"Unable to find module named {name!r}"
        else:
            msg = f"Unable to import module named {name!r}: {str(e)}"
        logger.error(msg)
        raise ZACException(msg) from e

And then based on the config, callers of this function could decide whether or not a ZACException should be fatal.

paalbra · 2023-05-25T08:22:22Z

I already created #58 .

I think collectors and modifiers currently behave differently and probably can't really be looked at like some general external module.

One is explicitly stated in the config. The other is "just there".

I'm happy to hear suggestions, but I still think this issue is valid: zac shouldn't start if collectors (explicitly stated in the config) are missing.

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

pederhan · 2023-05-25T08:54:55Z

Isn't that last code bit just kind of wrapping the exception? Does it really add anything?

Not by itself no, but callers of this function would know that they only need to handle ZACException, which is more predictable than all the possible error states from importlib.import_module. Furthermore, it would deduplicate the import code used for collectors and modified.

So instead of

zabbix-auto-config/zabbix_auto_config/__init__.py

Lines 23 to 33 in 2b45f1c

    
           def get_source_collectors(config): 
        
               source_collector_dir = config.zac.source_collector_dir 
        
               sys.path.append(source_collector_dir) 
        
               source_collectors = [] 
        
               for source_collector_name, source_collector_values in config.source_collectors.items(): 
        
                   try: 
        
                       module = importlib.import_module(source_collector_values.module_name) 
        
                   except ModuleNotFoundError: 
        
                       logging.error("Unable to find source collector named '%s' in '%s'", source_collector_values.module_name, source_collector_dir) 
        
                       continue

we would do:

def get_source_collectors(config: models.Settings) -> List[SourceCollectorDict]:
    source_collector_dir = config.zac.source_collector_dir
    sys.path.append(source_collector_dir)

    source_collectors = []  # type: List[SourceCollectorDict]
    for (source_collector_name, source_collector_values) in config.source_collectors.items():
        try:
            load_module(source_collector_values.module_name)
        except ZACException as e:
            if config.zac.source_collectors.ignore_missing:
                logging.warning(
                    "Source collector '%s' import failed: '%s'",
                    source_collector_values.module_name,
                    str(e),
                )
                continue
            raise
       # exceptions not handled by `load_module` will always fail
    # ...

So yes, it "only" wraps the error, and we could do a bare try...except Exception:, but this way we always log when an import fails in load_module, but the caller decides how to handle the missing import. The same load_module function should of course be re-used when importing host modifiers.

Anyway, I don't feel too strongly about this, but I think fundamentally importing modifiers and collectors is the same thing; we are doing dynamic imports on runtime, and there should be predictable behavior shared between the two.

paalbra · 2023-05-25T09:29:55Z

I have no strong opinion when it comes to wrapping in a load_module. It could be useful.

My only strong opinion in this issue is that zac should fail rather than continue when a collector is missing. It's a severe error.

There is a big difference here between collectors and modifiers since only collectors can be missing. Zac doesn't know what modifiers to expect. Maybe this should change in the future. Maybe zac should have configured a list of enabled modifiers. I think that's not a bad idea.

paalbra mentioned this issue May 25, 2023

If a modifier fails the host should be discarded #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should crash if configured collectors are missing #57

Should crash if configured collectors are missing #57

paalbra commented May 23, 2023

pederhan commented May 24, 2023 •

edited

Loading

paalbra commented May 24, 2023

pederhan commented May 25, 2023

paalbra commented May 25, 2023

pederhan commented May 25, 2023 •

edited

Loading

paalbra commented May 25, 2023

Should crash if configured collectors are missing #57

Should crash if configured collectors are missing #57

Comments

paalbra commented May 23, 2023

pederhan commented May 24, 2023 • edited Loading

paalbra commented May 24, 2023

Missing collector, this issue

Failing collector

Failing modifier

pederhan commented May 25, 2023

paalbra commented May 25, 2023

pederhan commented May 25, 2023 • edited Loading

paalbra commented May 25, 2023

pederhan commented May 24, 2023 •

edited

Loading

pederhan commented May 25, 2023 •

edited

Loading