Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'IndexError: list index out of range' in Prometheus Scraping of Scaphandre Metrics #355

Open
CherifMZ opened this issue Feb 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@CherifMZ
Copy link

CherifMZ commented Feb 1, 2024

I have successfully installed Scaphandre on my Kubernetes cluster using the provided documentation here. The installation command includes enabling ServiceMonitor and setting the interval to 2 seconds:

helm install scaphandre helm/scaphandre --set serviceMonitor.enabled=true --set serviceMonitor.interval=2s

Additionally, I have set up Prometheus and adjusted its configuration to a global scraping interval of 2 seconds with a timeout of 1 second.

My objective is to monitor the energy usage metric of each node, for which I created a Python script executed in a Jupyter Notebook. The script queries Prometheus for the 'scaph_host_energy_microjoules' metric in a loop:

import requests
import time

prometheus = 'http://localhost:9090/'

while True:
    energy_query = 'scaph_host_energy_microjoules'
    response_energy = requests.get(prometheus + '/api/v1/query', params={'query': energy_query})
    result_energy = response_energy.json().get('data', {}).get('result', [])
    
    # The error occurs here after some runtime
    energy_usage = float(result_energy[action]['value'][1])
    
    time.sleep(5)

After running the script for approximately 40 minutes, an 'IndexError: list index out of range' occurs. This issue seems to indicate that Prometheus is unable to scrape metrics from all three nodes consistently. It appears that the Scaphandre pod responsible for gathering node metrics periodically goes down and then restarts, causing intermittent interruptions (it is like sometimes, for 1s, two pods out of 3 works).

Additional details:

  • The cluster is created using Kind with a basic configuration (2 worker nodes and 1 master).
  • Action can be 0, 1, or 2, corresponding to the three nodes in the cluster.

I suspect that the problem might be related to the scrape interval. Your insights and suggestions on resolving this issue would be greatly appreciated. Thank you in advance for your assistance.

uname -a
Linux my_pc 6.1.0-1029-oem #29-Ubuntu SMP PREEMPT_DYNAMIC Tue Jan  9 21:07:34 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/cpuinfo
model		: 186
model name	: 13th Gen Intel(R) Core(TM) i5-1335U
@CherifMZ CherifMZ added the bug Something isn't working label Feb 1, 2024
@mmadoo
Copy link
Contributor

mmadoo commented Feb 1, 2024

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes.
Why is the reason to set a such short interval ?
The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

@CherifMZ
Copy link
Author

CherifMZ commented Feb 2, 2024

I am using a scrape interval of 5 minutes and a scrape timeout of 2 minutes. Why is the reason to set a such short interval ? The risk with that timeout is that scaphandre uses all CPU if you a lot of pods.

I'm using Machine Learning, so I have to have updated data

@bpetit bpetit added this to General Jun 19, 2024
@bpetit bpetit moved this to Triage in General Jun 19, 2024
@bpetit
Copy link
Contributor

bpetit commented Oct 17, 2024

Hi, do you have any logs from scaphandre on the nodes, close to the restart ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Triage
Development

No branches or pull requests

3 participants