You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have installed Fluent Bit in our EKS cluster to transfer logs to OpenSearch. We have noticed regular timeout errors during the connection to OpenSearch. Interestingly, most logs are transferred correctly, but a small proportion of them fail to transfer. It is unclear to us how to resolve this issue. Even when we enable debug mode, we do not receive any meaningful error messages.
I realize that there are many issue tickets that describe a similar problem. Unfortunately, the recommendations described there do not help me. I have tried to set different values for net.dns.mode and net.dns.resolver. I have also tried switching off the TLS verification tls.verify or changing the retry limit Retry_Limit. Nothing helped.
Configuration
Fluent Bit runs in a EKS cluster (Kubernetes 1.28)
It is installed with aws-for-fluent-bit helm chart 0.1.33 and application version 2.32.2.20240425.
Here is our current fluent-bit.conf
[SERVICE]
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5
Log_Level warn
Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
multiline.parser docker, cri
Mem_Buf_Limit 20MB
Skip_Long_Lines On
Refresh_Interval 10
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc.cluster.local:443
Merge_Log On
Merge_Log_Key data
Keep_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude On
Buffer_Size 10MB
[FILTER]
Name lua
Match kube.*
script /fluent-bit/lua/filters.lua
call format_logs
[OUTPUT]
Name opensearch
Match *
AWS_Region eu-central-1
AWS_Auth On
Host vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com
Port 443
tls on
Buffer_Size 20MB
Index aws-fluent-bit
Type _doc
Logstash_Format On
Logstash_Prefix logstash
Logstash_DateFormat %Y.%m.%d
Time_Key @timestamp
Time_Key_Format %Y-%m-%dT%H:%M:%S
Time_Key_Nanos Off
Include_Tag_Key Off
Tag_Key _flb-key
Generate_ID Off
Write_Operation create
Replace_Dots On
Trace_Output Off
Trace_Error On
Current_Time_Index On
Logstash_Prefix_Key os_index
Suppress_Type_Name On
net.dns.mode TCP
Here is the log:
┌─────────────────────────────────────────────────────────────── Logs(fluent-bit/fluent-bit-5zt55:aws-for-fluent-bit)[1m] ────────────────────────────────────────────────────────────────┐
│ Autoscroll:On FullScreen:Off Timestamps:Off Wrap:Off │
│ Fluent Bit v1.9.10 │
│ * Copyright (C) 2015-2022 The Fluent Bit Authors │
│ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd │
│ * https://fluentbit.io │
│ │
│ [2024/06/06 09:35:13] [ info] [fluent bit] version=1.9.10, commit=9be1f19e5a, pid=1 │
│ [2024/06/06 09:35:13] [ info] [storage] version=1.4.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128 │
│ [2024/06/06 09:35:13] [ info] [cmetrics] version=0.3.7 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] multiline core started │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc.cluster.local port=443 │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] token updated │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] local POD info OK │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server... │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] connectivity OK │
│ [2024/06/06 09:35:13] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 │
│ [2024/06/06 09:35:13] [ info] [sp] stream processor started │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=9437515 watch_fd=1 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-eks-nodeagent-7738e90d0c853740f085 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=8392764 watch_fd=2 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-node-4185d1a2546b33074dbe1f3a2db65 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7396358 watch_fd=3 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-vpc-cni-init-66351ba3e7cad47c93f6e │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18874491 watch_fd=4 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_ebs-plugin-91ff1c3f87b909e0c468f6 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=22020220 watch_fd=5 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_liveness-probe-b9e85a168bc24c359c │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19923074 watch_fd=6 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_node-driver-registrar-787c96c9e42 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=5476078 watch_fd=7 name=/var/log/containers/prometheus-prometheus-node-exporter-h8mxh_prometheus_node-exporte │
│ [2024/06/06 09:35:14] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7019429 watch_fd=8 name=/var/log/containers/kube-proxy-82zbg_kube-system_kube-proxy-ac8295ddd275612f5634a2127 │
│ ... │
│ [2024/06/06 09:36:14] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:36:14] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve │
│ [2024/06/06 09:36:29] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:36:29] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve │
│ [2024/06/06 09:36:29] [ warn] [engine] failed to flush chunk '1-1717666510.580386114.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=opensearch.0 (out_id=0) │
│ [2024/06/06 09:36:48] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:36:48] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve │
│ [2024/06/06 09:36:48] [ warn] [engine] failed to flush chunk '1-1717666511.76029697.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=opensearch.0 (out_id=0) │
│ [2024/06/06 09:37:07] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:37:07] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve │
│ [2024/06/06 09:37:07] [ warn] [engine] failed to flush chunk '1-1717666526.88917251.flb', retry in 6 seconds: task_id=2, input=tail.0 > output=opensearch.0 (out_id=0) │
│ [2024/06/06 09:37:35] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:37:35] [ warn] [engine] failed to flush chunk '1-1717666526.562789262.flb', retry in 7 seconds: task_id=3, input=tail.0 > output=opensearch.0 (out_id=0) │
│ [2024/06/06 09:37:53] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds │
│ [2024/06/06 09:37:53] [engine] caught signal (SIGSEGV) │
│ [2024/06/06 09:37:53] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve │
│ Stream closed EOF for fluent-bit/fluent-bit-8rj9t (aws-for-fluent-bit) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
After a while, if the error occurs too often, the pod will restart. As you can see in the log above.
The text was updated successfully, but these errors were encountered:
We have installed Fluent Bit in our EKS cluster to transfer logs to OpenSearch. We have noticed regular timeout errors during the connection to OpenSearch. Interestingly, most logs are transferred correctly, but a small proportion of them fail to transfer. It is unclear to us how to resolve this issue. Even when we enable debug mode, we do not receive any meaningful error messages.
I realize that there are many issue tickets that describe a similar problem. Unfortunately, the recommendations described there do not help me. I have tried to set different values for net.dns.mode and net.dns.resolver. I have also tried switching off the TLS verification tls.verify or changing the retry limit Retry_Limit. Nothing helped.
Configuration
Fluent Bit runs in a EKS cluster (Kubernetes 1.28)
It is installed with aws-for-fluent-bit helm chart 0.1.33 and application version 2.32.2.20240425.
Here is our current fluent-bit.conf
Here is the log:
After a while, if the error occurs too often, the pod will restart. As you can see in the log above.
The text was updated successfully, but these errors were encountered: