Fix 'now()' not working on label values #1

kir4h · 2020-01-19T13:07:46Z

First of all, thanks for your project! While browsing for a way to generate metrics from arbitrary logs, I didn't want to go to complex/grok based solutions since my requirement was pretty simple. I thought about writing this functionality myself only to find out someone had already done so :)

According to the documentation, the function now() should be valid for label values:

Supported values (for metric results and label values):

Static numbers, such as 1, 10.3e5

Regex group references, such as $pool from the regexp pool=(?P<pool>[^ ]+) msg

now() which returns the current Unix time

This PR makes now() evaluated also for labels.

I have split the PR into 3 commits:

The fix itself
An amendment to the test of the now() usage in metric values. Please correct me if I'm wrong or misunderstood the test
Updating of the packages through dep ensure

Feel free to drop any commit or let me know if any changes required/suggested.

hoffie · 2020-01-20T20:53:28Z

Thanks for the PR, looks interesting, will have a deeper look in the coming days. :)

hoffie · 2020-01-29T20:54:24Z

I really like it code-wise, thanks for your effort!

I am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Can you describe what the use case for this is? Based on that, we can better judge whether to merge this as-is or if there are any other options for achieving your goals.

I agree that something has to be fixed -- the docs are currently wrong. Either the feature gets implemented or the docs will have to be corrected.

Let me know what you think.

kir4h · 2020-01-30T09:01:13Z

am a bit unsure about the semantics. Prometheus-wise, such metrics are going to be problematic, aren't they? Having a timestamp as a label might well lead to cardinality explosion.

Indeed! I realized this some days after raising my PR, while making usage of the output :)

Can you describe what the use case for this is?

Scenario
My journey began with a set of shell scripts I wanted to instrument. I started using Pushgateway for this, by adding a small logic to some of them to push their metrics. But then I realized I shouldn't "contaminate" the scripts with prometheus awareness, and ended up thinking that "something" should derive that info from the logs and generate those metrics (that's how I ended up using this project)

When running such scripts regularly through cron jobs, a nice metric is being aware on when the script was successfully executed for the last time (so that I can generate an alert for instance if it doesn't happen when it's supposed to).

While using this project, I realized I had lost one of the goodies from pushgateway: for every pushed metric for a given group, generates additional metrics (push_time_seconds and push_failure_time_seconds) that allowed knowing when was the last time the metric was reported.

First approach

In order to overcome this, I though of using labels for my metric. Whenever a regexp was matched in a log file, and that a corresponding metrics was to be issued, a label was added to the timestamp. This way, I could read this value in Prometheus/Grafana and do something about it.

It was my first contact with metric generation, so only later I realized the cardinality issue: Each metric, because of having different label value for timestamp, was increasing cardinality, and this would continue happening on every metric issuing. At that point I should have come back and commented it in this PR, but it didn't cross my mind, sorry!

Second approach

My second (and current approach) has been to add a "sidecar" metric to every match, that reports this value

logs:
- path: /logs/my.log
  patterns:
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: duration_seconds
    type: gauge
    help: Execution time in seconds
    action: set
    continue: true
    value: $seconds
    labels:
      instance: myInstance
      job: myJobName
  - match: .*total time execution:\s*(?P<seconds>\d+)\s*seconds
    metric: last_execution
    type: gauge
    help: Last execution time in seconds
    action: set
    continue: true
    value: now()
    labels:
      instance: myInstance
      job: myJobName

This isn't ideal because for every match, I need to inject a repetitive block that issues this "sidecar metric"

Possible third approach

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

I still have little Prometheus knowledge so I might be missing something, but I guess if Pushgateway does this is because there is a legitimate use case for it.

What do you think?

hoffie · 2020-02-08T19:20:55Z

Your second approach would be what I have suggested with the current version, yes. :)

I was thinking that a possible approach would be multilog_exporter mimicking pushgateway behaviour, having a boolean flag that when set would produce an additional metric reporting the last time the metric was updated for a given group.

This sounds like a useful enhancement. Would probably nice to be able to control that by metric? Would you want to work on a PR for that one? Guess a new one would make sense.

kir4h · 2020-02-09T14:46:53Z

Closing this in favour of #3

kir4h added 3 commits January 19, 2020 13:55

Add now() function to labels evaluation

28fb383

Fix test for time checking

96dd1b5

Update dep packages

cb1182c

kir4h changed the title ~~Fix now not working on label values~~ Fix 'now()' not working on label values Jan 19, 2020

kir4h requested a review from hoffie January 19, 2020 14:06

kir4h mentioned this pull request Feb 9, 2020

Allow issuing a metric having as value the timestamp of the last update of the metric #3

Open

kir4h closed this Feb 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 'now()' not working on label values #1

Fix 'now()' not working on label values #1

kir4h commented Jan 19, 2020

hoffie commented Jan 20, 2020

hoffie commented Jan 29, 2020

kir4h commented Jan 30, 2020

hoffie commented Feb 8, 2020

kir4h commented Feb 9, 2020

Fix 'now()' not working on label values #1

Fix 'now()' not working on label values #1

Conversation

kir4h commented Jan 19, 2020

hoffie commented Jan 20, 2020

hoffie commented Jan 29, 2020

kir4h commented Jan 30, 2020

hoffie commented Feb 8, 2020

kir4h commented Feb 9, 2020