'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

CherifMZ · 2024-02-01T14:07:24Z

I have successfully installed Scaphandre on my Kubernetes cluster using the provided documentation here. The installation command includes enabling ServiceMonitor and setting the interval to 2 seconds:

helm install scaphandre helm/scaphandre --set serviceMonitor.enabled=true --set serviceMonitor.interval=2s

Additionally, I have set up Prometheus and adjusted its configuration to a global scraping interval of 2 seconds with a timeout of 1 second.

My objective is to monitor the energy usage metric of each node, for which I created a Python script executed in a Jupyter Notebook. The script queries Prometheus for the 'scaph_host_energy_microjoules' metric in a loop:

import requests
import time

prometheus = 'http://localhost:9090/'

while True:
    energy_query = 'scaph_host_energy_microjoules'
    response_energy = requests.get(prometheus + '/api/v1/query', params={'query': energy_query})
    result_energy = response_energy.json().get('data', {}).get('result', [])
    
    # The error occurs here after some runtime
    energy_usage = float(result_energy[action]['value'][1])
    
    time.sleep(5)

After running the script for approximately 40 minutes, an 'IndexError: list index out of range' occurs. This issue seems to indicate that Prometheus is unable to scrape metrics from all three nodes consistently. It appears that the Scaphandre pod responsible for gathering node metrics periodically goes down and then restarts, causing intermittent interruptions (it is like sometimes, for 1s, two pods out of 3 works).

Additional details:

The cluster is created using Kind with a basic configuration (2 worker nodes and 1 master).
Action can be 0, 1, or 2, corresponding to the three nodes in the cluster.

I suspect that the problem might be related to the scrape interval. Your insights and suggestions on resolving this issue would be greatly appreciated. Thank you in advance for your assistance.

uname -a
Linux my_pc 6.1.0-1029-oem #29-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan  9 21:07:34 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cpuinfo
model		: 186
model name	: 13th Gen Intel(R) Core(TM) i5-1335U

The text was updated successfully, but these errors were encountered:

mmadoo · 2024-02-01T20:00:25Z

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes.
Why is the reason to set a such short interval ?
The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

CherifMZ · 2024-02-02T08:20:18Z

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

I'm using Machine Learning, so I have to have updated data

bpetit · 2024-10-17T09:36:35Z

Hi, do you have any logs from scaphandre on the nodes, close to the restart ?

CherifMZ added the bug Something isn't working label Feb 1, 2024

bpetit added this to General Jun 19, 2024

bpetit moved this to Triage in General Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

CherifMZ commented Feb 1, 2024 •

edited

Loading

mmadoo commented Feb 1, 2024

CherifMZ commented Feb 2, 2024

bpetit commented Oct 17, 2024

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

Comments

CherifMZ commented Feb 1, 2024 • edited Loading

mmadoo commented Feb 1, 2024

CherifMZ commented Feb 2, 2024

bpetit commented Oct 17, 2024

CherifMZ commented Feb 1, 2024 •

edited

Loading