Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointers on using this with Hetzer Vms and bare metal #43

Open
mrchrisadams opened this issue Apr 26, 2024 · 5 comments
Open

Pointers on using this with Hetzer Vms and bare metal #43

mrchrisadams opened this issue Apr 26, 2024 · 5 comments

Comments

@mrchrisadams
Copy link

mrchrisadams commented Apr 26, 2024

Hi Arne / Didi / other green coding folks,

Thanks for publishing this project - I have a couple of questions about using this to turn utilisation figures into power figures in watts, to turn into carbon figures later, and I hope you can help.

I'm looking at coming up with some numbers for a few VMs on Hetzner to work out some better figures for the green web platform. We mainly a mix of the following instance types:

  • cx51
  • cx21
  • cpx21

I believe these run on a mixture of Intel® Xeon® Gold or AMD EPYC™ 7002 processors.

If it helps provide some useful context, you can see a sketch below of the setup. The db server is a cx51, we have a scalable pool of cpx21s as app server workers, and a monitoring box is a cx21.

image

Here are the things I'd appreciate pointers on.

Working out the threads and cores, and frequency

I think I can use lscpu used to query the underlying cpu for each machine

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      40 bits physical, 48 bits virtual
CPU(s):                             3
On-line CPU(s) list:                0-2
Thread(s) per core:                 1
Core(s) per socket:                 3
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC Processor
Stepping:                           0
CPU MHz:                            2445.404
BogoMIPS:                           4890.80
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          96 KiB
L1i cache:                          96 KiB
L2 cache:                           1.5 MiB
L3 cache:                           16 MiB

From here, I'm assuming I'd use these values, as inputs for the spec power model, right?

Thread(s) per core:                 1
Core(s) per socket:                 3
CPU MHz:                            2445.404

Working out the vhost ratio

vHost Ratio [float (0,1])

  • The vHost ratio on the system you are on. If you are on a bare metal machine this is 1
  • If you are a guest and have e.g. 24 of the 96 Threads than the ratio would be 0.25
  • Currently the model cannot account for non-balanced CPU and memory ratios.

This I'm less confident about - I was under the impression the virtualisation / overcommitment ratio was usually obscured by cloud providers, and while companies like Hetzner have both dedicated VCPU variants as well as dedicated VCPU variants of virtual machines, I'm not sure how to represent this in when running this tool to turn utilisation figures into energy.

Do you have an example you have worked on before that I can use as a reference, or any pointers on how I'd estimate or otherwise account for the level of overcommitment of virtual CPUs to physical CPUs on a box?

Thanks and hope the eco-compute conference went well 👍

@ArneTR
Copy link
Member

ArneTR commented Apr 26, 2024

This is a longer one :) Let me get back to you on this once we have recovered from EcoCompute.

Just a question for the meantime: Did you try the parameter auto discovery? What values are you getting? Which ones could not be discovered?

Try: python3 xgb.py --auto

@ArneTR
Copy link
Member

ArneTR commented May 2, 2024

Hey @mrchrisadams

now had the chance to revisit this post. Could you try the --auto mode already?

In general using the CLI there are these two variants:

1. Auto mode

If you do not much about the system use the --auto mode. It will auto-discover the paramters that your current user can see. If you are on a bare metal machine you will get the correct parameters and a good estimation.

If you are on a VM you will get wrong parameters as you can only see the ressources assigned to you. The model will set the --vhost-ratio to 1.

The result is that your VM, which is a slice of a big bare metal machine, looks like a much smaller bare metal machine. The estimated value for energy will in turn be off. But the energy curve will be the one of an actual machine and also sport it's non-linear behaviour.

2. Manual mode

If you can find the data sheets of the machine you are working on the manual mode is preferred. Here you must set the parameters to the one of the actual bare metal machine and then use the --vhost-ratio to map the ressource allocation of the VM. (This can be done in multiple ways. See our discussion on how we implemented it )

Assuming your machine is a Epyc 7002 and the infos from Hetzner indicate that they use the 8-core variant (See AMD data page and Hetzner page).

What we do is we look at what the maxium available VM size is and then assume that that is the bare metal variant. In the Hetzner case the biggest machine for the CPX plans is a shared 16 vCPU. This means the machine has 16 Threads, which are shared (Hyperthreads). The full machine thus has 8 cores and 16 threads. The --vhost-ratio in your CPX21 case, which has 3 threads assigned, is thus 3/16 = 0.1875

If you would craft it manually the CLI for the shown AMD cpu string would look like this:

python3 xgb.py --cpu-chips 1 --cpu-threads 16 --cpu-cores 8 --cpu-make amd --cpu-freq 3700 --vhost-ratio 0.1875

Since you have no info about the exact model the TDP is unknown.

You would not set the CPU frequency from lscpu in this case as the value that lscpu shows is to my knowledge the current freq. The model needs the base frequency though which is 3.7 GHz according to the data sheet for the 8 core model.

Additional

See also an example how we set the values for the Github machines here: https://github.com/green-coding-solutions/eco-ci-energy-estimation/blob/1fcfe95976c4f8a78e3248be5072ad4797610d42/scripts/vars.sh#L50

Summary

Let me know if that explanation is helpful and please add any findings that came up for you. It would be great to add this to the documentation to help other users.

Thanks for making such a detailed question!

@mrchrisadams
Copy link
Author

Hi @ArneTR, sorry about the slow reply.

I'll post a representative output for types of instances we are using:

  • cx51 (used in the bigger DB server)
  • cpx31 (one larger worker instance we keep in the worker pool, that's grandfathered in on the older, more generous pricing)
  • cpx21 (our pool of smaller workers)
  • cx21 (smallest worker, used for our monitoring / logging box)

app1, representatative of a CPX31 instance

deploy@app1:~/spec-power-model$ python3.11 xgb.py --auto
No arguments where supplied, or auto mode was forced. Running auto detect on the sytem.
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not read RAPL powercapping info from /sys/class/powercap/intel-rapl
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not find (additional) chips info under file path. Most likely reached final chip. continuing ...
Found Threads: 4
Found Sockets: 1 (will take precedence if not 0)
Found cores: 4
Could not find Frequency. Using default None
Found Make: amd
Found Memory: 8 GB
The following data was auto detected: {'freq': None, 'threads': 4, 'cores': 4, 'tdp': None, 'mem': 8, 'make': 'amd', 'chips': 1}
vHost ratio is set to 1.0
Training data will be restricted to the following amount of chips: 1
Model will be trained on the following columns and restrictions:
   CPUThreads  CPUCores  HW_MemAmountGB  utilization  CPUMake_amd
0           4         4               8          0.0         True

app2 - representative of CPX21 instances

(.venv) deploy@app2:~/spec-power-model$ python ./xgb.py --auto
No arguments where supplied, or auto mode was forced. Running auto detect on the sytem.
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not read RAPL powercapping info from /sys/class/powercap/intel-rapl
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not find (additional) chips info under file path. Most likely reached final chip. continuing ...
Found Threads: 3
Found Sockets: 1 (will take precedence if not 0)
Found cores: 3
Could not find Frequency. Using default None
Found Make: amd
Found Memory: 4 GB
The following data was auto detected: {'freq': None, 'threads': 3, 'cores': 3, 'tdp': None, 'mem': 4, 'make': 'amd', 'chips': 1}
vHost ratio is set to 1.0
Training data will be restricted to the following amount of chips: 1
Model will be trained on the following columns and restrictions:
   CPUThreads  CPUCores  HW_MemAmountGB  utilization  CPUMake_amd
0           3         3               4          0.0         True

db2 - representative of a CX51 instance

(.venv) deploy@db2:~/spec-power-model$ python3.10 ./xgb.py --auto
No arguments where supplied, or auto mode was forced. Running auto detect on the sytem.
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not read RAPL powercapping info from /sys/class/powercap/intel-rapl
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not find (additional) chips info under file path. Most likely reached final chip. continuing ...
Found Threads: 8
Found Sockets: 1 (will take precedence if not 0)
Found cores: 8
Could not find Frequency. Using default None
Found Memory: 31 GB
The following data was auto detected: {'freq': None, 'threads': 8, 'cores': 8, 'tdp': None, 'mem': 31, 'make': None, 'chips': 1}
vHost ratio is set to 1.0
Training data will be restricted to the following amount of chips: 1
Model will be trained on the following columns and restrictions:
   CPUThreads  CPUCores  HW_MemAmountGB  utilization
0           8         8              31          0.0

mon1 - representative of CX21 instances

(.venv) deploy@monitoring:~/spec-power-model$ python3.11 xgb.py --auto
No arguments where supplied, or auto mode was forced. Running auto detect on the sytem.
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not read RAPL powercapping info from /sys/class/powercap/intel-rapl
Exception: [Errno 2] No such file or directory: '/sys/class/powercap/intel-rapl/intel-rapl:0/name'
Could not find (additional) chips info under file path. Most likely reached final chip. continuing ...
Found Threads: 1
Found Sockets: 1 (will take precedence if not 0)
Found cores: 2
Could not find Frequency. Using default None
Found Memory: 4 GB
The following data was auto detected: {'freq': None, 'threads': 1, 'cores': 2, 'tdp': None, 'mem': 4, 'make': None, 'chips': 1}
vHost ratio is set to 1.0
Training data will be restricted to the following amount of chips: 1
Model will be trained on the following columns and restrictions:
   CPUThreads  CPUCores  HW_MemAmountGB  utilization
0           1         2               4          0.0
Infering all predictions to dictionary

I realise this expects some streaming output to turn utilisation figures into energy numbers here, and see this in the readme.

You must call the python file ols.py or xgb.py. This file is designed to accept streaming inputs.

A typical call with a streaming binary that reports CPU Utilization could look like so:

$ ./static-binary | python3 ols.py --tdp 240
191.939294374113
169.99632303510703
191.939294374113
191.939294374113
191.939294374113
191.939294374113
194.37740205685841

In this case here, would the compiled static binary be this one here? The docs suggest that are, but I wasn't sure.

https://github.com/green-coding-solutions/green-metrics-tool/tree/main/metric_providers/cpu/utilization/procfs/system
https://docs.green-coding.io/docs/measuring/metric-providers/cpu-utilization-procfs-system/

@ArneTR
Copy link
Member

ArneTR commented May 18, 2024

Thanks for the --auto runs. Quite interesting what you can see on the machines!

As said: You can use --auto to get something going. Better estimations happen when you configure the parameters manually with educated information. I hope my pointers how I did this examplary for the CPX21 helps!

  1. Regarding the static-binary: This is in the subfolder of the repo itself. See https://github.com/green-coding-solutions/spec-power-model/tree/main/demo-reporter

Just compile it with gcc -o static-binary and you have your static binary.

This is mentioned under: https://github.com/green-coding-solutions/spec-power-model?tab=readme-ov-file#demo-reporter

I see from the discussion that our documentation is not as good as we would like it to be :(

I would really love to take some pointers from you what you found confusing and how we can improve! If you can also a PR for the documentation would be very happily accepted :)

@ArneTR
Copy link
Member

ArneTR commented May 18, 2024

Since you are setting the model on your machines maybe a pointer to an adjacent project that might also be lacking some awareness and documentation.

What we do with these mini clients on our Hetzner machines is that we all stream their data to our product CarbonDB.

It is basically a data-drain that accepts inputs from all our tools:

  • SpecPowerModel
  • Eco-CI
  • GreenMetricsTool
  • PowerHog
  • etc.

You can then see all or your infrastructure (Servers, Pipelines, MacBooks etc.) in one summarizing view.

Example for us:

This view shows you the Carbon Cost of the actual webserver that delivers https://metrics.green-coding.io/ + some of our pipelines we run on Github.

We have not integrated all of our machines, but it is possible!

In order to leverage this feature for your pipelines specifically you need to set the values:

  • company-uuid (must be set manually pre-hand)
  • project-uuid (must be set manually pre-hand)
  • machine-uuid (Will be auto generated, but can be forced to a manual value)

See documentation here: https://github.com/green-coding-solutions/eco-ci-energy-estimation

Effectively you would have to add these three values to your workflow.

Please give me a separate ping if this is helpful to you and you maybe wanna integrate this in top.

At the moment this is also a free feature with not data cap we provide. If you want to integrate this elsewhere it is of course also open-source and can be self-hosted. metrics.green-coding.io is just the service we provide for free atm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants