Suppress duplicate log messages (hash based dedup) #4759

mbihoop · 2022-02-07T13:04:45Z

mbihoop
Feb 7, 2022

Last week, our team demonstrated log forwarding from Kubernetes workloads to Fluent Bit to Splunk. Unfortunately, we set one of the containers to a very verbose mode during the demo and created a deluge of log messages.

Fluent Bit handled the messages like a champ. Unfortunately, Splunk was running within a memory-restricted VM on the developer laptop and didn't do so well as it is a bit of a memory hog. Memory growth became elevated, and within a short time, the system locked up, and we needed to end the demo.

In the situation I've outlined above, the log messages were all identical, the only difference being the timestamp.

It got me thinking, "is it possible for Fluent Bit to filter out duplicate messages?".

I did a quick google search and found the following Ruby language plugins, which have been written quite some time ago for fluentd, as opposed to Fluent Bit.

Fluentd plugin to suppress same messages: fujiwara/fluent-plugin-suppress.
Fluentd plugin for removing duplicate logs: edvakf/fluent-plugin-dedup.

I wonder, and I can't immediately find an answer in the documentation; despite finding other load-regulation filters such as leaky-bucket, an equivalent mechanism or extension point for Fluent Bit which, in the following scenario, will suppress duplicate log events:

"When presented with 100,000 identical, sequential log messages sent over 5 minutes, Fluent Bit will only forward the first message in the sequence to Splunk, then suppress any following messages determined to be identical by some field comparison for a period of, say 5 minutes following the first message?"

agup006 · 2022-02-15T03:18:39Z

agup006
Feb 15, 2022
Maintainer

Thanks for subnimtting this @mbihoop , I like the idea of suppresion, which can be super useful in cutting down costs. Perhaps we can make it based off message size too, which might be easier to calculate at high throughput

One note though is a really common scenario I see is when in K8s outputting to stdout and then re-ingesting the logs and then reoutputting to stdout, the comparison wouldn't have much affect in that scenario

4 replies

mbihoop Feb 28, 2022
Author

One note though is a really common scenario I see is when in K8s outputting to stdout and then re-ingesting the logs and then reoutputting to stdout, the comparison wouldn't have much affect in that scenario

Wow, that sounds terrible, where does that happen, with container logging?

patrick-stephens Mar 20, 2022
Maintainer

@mbihoop just a couple of examples I thought of, there are probably more...

If you set up a stdout filter or output for fluent bit then output ends up in the log for that pod, now if you set up Fluent Bit to consume all pod logs (as you generally want to) then it'll consume its own log which is then when you re-ingest that previous output in an Inception-style loop.

It may be that you do want to ingest Fluent Bit output so you can't just ignore stuff from Fluent Bit, e.g. if you're running as a sidecar to integrate some legacy log into your usual cluster logging architecture then you can do this. Arguably the Forward output could be used in this case but may be the application using the sidecar has to support any Kubernetes deployment (e.g. one not using Fluent Bit as a daemonset) so outputting to stdout is the correct way to do this.

Now, if we had a good way to distinguish the two cases and remove the duplicated re-ingested data that would be ace! The typical way would be to annotate the Fluent Bit daemonset pods and exclude them from ingestion, although it is useful and required sometimes to include your Fluent Bit logs as well for various reasons.

mbihoop Mar 21, 2022
Author

Hmm, that sounds like a user misconfiguration, sending output to input. We don't need to do something as obtuse as that to provoke a flood of duplicate messages. Something as simple as an application with a logging block inside a tight loop or retry block will spew forth an endless stream of duplicate messages. Anyhow, it seems relatively easy to deal with given a memory/CPU tradeoff. Perhaps such optimized techniques are patented, I don't know, otherwise, it would seem like a good way to add value or add an "enterprise" feature.

mbihoop Mar 21, 2022
Author

The authors of Rsyslog give some pretty good reasons why log de-duplication (at the input source) is considered a bad idea.

https://www.rsyslog.com/doc/master/configuration/action/rsconf1_repeatedmsgreduction.html#discussion

Versions before 7.3.2 applied repeat message reduction to the output side. This had some implications:

they did not account for the actual message origin, so two processes emitting an equally-looking message triggered the repeated message reduction code

repeat message processing could be set on a per-action basis, which has switched to per-input basis for 7.3.2 and above

While turning this feature on can save some space in logs, most log analysis tools need to see the repeated messages, they can’t handle the “last message repeated” format.

This is a feature that worked decades ago when logs were small and reviewed by a human, it fails badly on high volume logs processed by tools.

You may have similar, valid concerns about this approach.

alternaivan · 2024-09-12T10:26:40Z

alternaivan
Sep 12, 2024

Hi,

Is there anything new on this topic? Is it possible to use a Lua script to suppress duplicate log messages on the fluent-bit side? And if so, did anyone try it?

Thanks!

3 replies

patrick-stephens Sep 12, 2024
Maintainer

Yes, you could do with a simple hash table or similar lookup approach to keep track of duplicates potentially. There's a few ways to do it I think depending on what exactly you need.

alternaivan Sep 16, 2024

Hi @patrick-stephens,

Thanks for your reply! I have an additional question. I see that in this PR, this feature is implemented on the http output.

Is this implemented? If so, will this configuration (log_suppress_interval) also work on other outputs, such as es? I'm not sure if this was implemented or not, and we are missing the documentation about it.

Thanks!

patrick-stephens Sep 17, 2024
Maintainer

I do not know, easiest way is to try it - invalid config should fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suppress duplicate log messages (hash based dedup) #4759

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Suppress duplicate log messages (hash based dedup) #4759

mbihoop Feb 7, 2022

Replies: 2 comments · 7 replies

agup006 Feb 15, 2022 Maintainer

mbihoop Feb 28, 2022 Author

patrick-stephens Mar 20, 2022 Maintainer

mbihoop Mar 21, 2022 Author

mbihoop Mar 21, 2022 Author

alternaivan Sep 12, 2024

patrick-stephens Sep 12, 2024 Maintainer

alternaivan Sep 16, 2024

patrick-stephens Sep 17, 2024 Maintainer

mbihoop
Feb 7, 2022

Replies: 2 comments 7 replies

agup006
Feb 15, 2022
Maintainer

mbihoop Feb 28, 2022
Author

patrick-stephens Mar 20, 2022
Maintainer

mbihoop Mar 21, 2022
Author

mbihoop Mar 21, 2022
Author

alternaivan
Sep 12, 2024

patrick-stephens Sep 12, 2024
Maintainer

patrick-stephens Sep 17, 2024
Maintainer