You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prometheus is continuously failing to send sample to Thanos with below error and getting delayed records, due to which its started failing after some time and not sending data to Thanos.
I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve the issue you're facing.
I found some relevant information that might help address the issue you're experiencing with Thanos Receive and the HTTP 500 Internal Server Error.
Common Causes of HTTP 500 Errors: These errors in Thanos Receive can be caused by several factors, such as out-of-order labels, duplicate labels, empty labels, out-of-order samples, duplicate samples for the same timestamp, and out-of-bounds samples. These issues often arise from misconfigurations or insufficient resources [1].
Handling Out-of-Order Samples: Thanos Receive can handle out-of-order samples using the tsdb.out-of-order.time-window flag. This feature allows for the ingestion of samples that do not meet the strict timestamp order requirements, but it increases memory usage and results in blocks with overlapping time periods. Ensure that the --compact.enable-vertical-compaction flag is enabled on the compactor to manage these overlapping blocks [2].
Retry Mechanism: While specific details on how Thanos handles retries for failed remote write requests are not available, it is generally recommended to ensure that your configuration is optimized to handle high ingestion rates and that adequate resources are allocated to the Thanos Receive component.
Similar Issues: There are similar discussions and issues related to Thanos Receive failing to handle requests and internal server errors. For example, this issue discusses similar problems, but no specific solution was found in the comments [3].
Reviewing your configuration and ensuring that it aligns with the recommended settings for handling out-of-order samples and managing resources effectively might help resolve the issue. Additionally, checking for any misconfigurations or resource constraints could be beneficial.
Hi Team,
we are running below version for Thanos and Prometheus
thanos thanos thanos-15.8.1 0.36.1
kube-prometheus-stack monitoring kube-prometheus-stack-67.8.0 v0.79.2
Prometheus is continuously failing to send sample to Thanos with below error and getting delayed records, due to which its started failing after some time and not sending data to Thanos.
Helm file content and logs are attached.
Request check and support to resolve this error.
prometheus-stack.log
thanos-receive.log
Helm values.txt
time=2025-01-31T04:56:26.039Z level=INFO source=queue_manager.go:1096 msg="Remote storage resharding" component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive from=50 to=34
time=2025-01-31T04:56:36.040Z level=INFO source=queue_manager.go:1099 msg="Currently resharding, skipping." component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive
time=2025-01-31T04:57:25.087Z level=WARN source=queue_manager.go:2050 msg="Failed to send batch, retrying" component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive err="server returned HTTP status 500 Internal Server Error: context deadline exceeded\n"
time=2025-01-31T04:57:26.039Z level=INFO source=queue_manager.go:1096 msg="Remote storage resharding" component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive from=50 to=31
time=2025-01-31T04:57:36.040Z level=INFO source=queue_manager.go:1099 msg="Currently resharding, skipping." component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive
time=2025-01-31T04:58:36.039Z level=INFO source=queue_manager.go:1096 msg="Remote storage resharding" component=remote remote_name=thanos url=http://thanos-receive.thanos.svc.cluster.local:19291/api/v1/receive from=42 to=50
Thanos logs :
segment=183 maxSegment=186
ts=2025-01-31T03:58:28.352732999Z caller=head.go:793 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=184 maxSegment=186
ts=2025-01-31T03:58:29.262991843Z caller=head.go:793 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=185 maxSegment=186
ts=2025-01-31T03:58:29.263442836Z caller=head.go:793 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL segment loaded" segment=186 maxSegment=186
ts=2025-01-31T03:58:29.263489476Z caller=head.go:830 level=info component=receive component=multi-tsdb tenant=default-tenant msg="WAL replay completed" checkpoint_replay_duration=9.208857598s wal_replay_duration=31.909334701s wbl_replay_duration=220ns chunk_snapshot_load_duration=0s mmap_chunk_replay_duration=1.493758856s total_replay_duration=42.612061925s
ts=2025-01-31T03:58:30.993521082Z caller=multitsdb.go:727 level=info component=receive component=multi-tsdb tenant=default-tenant msg="TSDB is now ready"
ts=2025-01-31T03:58:30.997204066Z caller=intrumentation.go:56 level=info component=receive msg="changing probe status" status=ready
ts=2025-01-31T03:58:30.997255597Z caller=receive.go:658 level=info component=receive msg="storage started, and server is ready to receive requests"
ts=2025-01-31T04:00:46.350786005Z caller=multitsdb.go:406 level=info component=receive component=multi-tsdb msg="Running pruning job"
ts=2025-01-31T04:01:33.008749634Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=63
ts=2025-01-31T04:03:31.03585799Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=9
ts=2025-01-31T04:39:17.876342842Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=8
ts=2025-01-31T04:41:45.933983223Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=5
ts=2025-01-31T04:48:22.73564749Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=9
ts=2025-01-31T04:48:58.737558554Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=5
ts=2025-01-31T04:57:22.766959638Z caller=writer.go:249 level=info component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting out-of-order samples" numDropped=36
ts=2025-01-31T05:00:47.877181501Z caller=compact.go:580 level=info component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1738288800000 maxt=1738296000000 ulid=01JJXBNVQ0GB9KCDJT2PJ51XG4 duration=47.269017327s ooo=false
ts=2025-01-31T05:00:50.009276769Z caller=head.go:1355 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" caller=truncateMemory duration=2.125990891s
ts=2025-01-31T05:00:50.074370398Z caller=checkpoint.go:101 level=info component=receive component=multi-tsdb tenant=default-tenant msg="Creat
prometheus-stack.log
thanos-receive.log
Helm values.txt
The text was updated successfully, but these errors were encountered: