Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(vm_metrics): Enabled vm metric use for local model training #464

Merged

Conversation

KaiyiLiu1234
Copy link
Collaborator

Currently, the model server training setup uses baremetal metrics only. Included a "use_vm_metrics" field to enable model training with job='vm' metrics and kepler_vm energy sources. This will be useful set up for further testing.

Enabled Training via Pipeline to use vm metrics for features
and kepler_vm energy readings for label/output.

Signed-off-by: Kaiyi <[email protected]>
Removed METAL_JOB_NAME as the unit tests does not create "metal" jobs.

Signed-off-by: Kaiyi <[email protected]>
src/kepler_model/cmd/main.py Outdated Show resolved Hide resolved
if use_vm_metrics:
aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] == VM_JOB_NAME]
else:
aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] != VM_JOB_NAME]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need this else condition. To avoid unexpected job filtering on general case, I think we can filter only when enable use_vm_metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that if the filter is not added, job=vm metrics will also be pulled by the model server alongside baremetal metrics which is not a desirable result right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking that the label job is a common label that could be relabeled for example if we want to label the pod with the application job. However, I understand your points that we might want to get the metric of BM also for the case that we enable the VM. After some thought, I think we can keep it as you code.

@@ -72,6 +77,10 @@ def energy_component_to_query(component):
return f"{node_query_prefix}_{component}_{node_query_suffix}"


def vm_energy_component_to_query(component):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use the the kepler_vm for training. We should use kepler_node but with job="metal" label (or job != "vm"). kepler_vm is not from the power meter but from the ratio which considering fairness distribution. In the case the we have only single vm, this might not be different but I think we better use the number from power meter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rootfs thoughts? If I were to use kepler_node, I would need to find the process id for the virtual machine. If I recall, kepler_node with vm process id and kepler_vm are the same value or very similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think vm process will be the same if the idle power is not zero since it will distribute evenly for all running process not only the vm process but I think kepler_vm and kepler_node should be the same if you have a single VM.

if query not in query_results:
print(query, "not in", query_results)
return None
aggr_query_data = query_results[query].copy()
if not use_vm_metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please comment an explanation of this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it is baremetal, I do not want it to use job=vm energy metric queries. We should only use baremetal metrics. Basically, I am making sure that the behavior of the extractor prior to my addition of vm metrics is the same as before (which is it will extract all the metrics with kepler_node_*). Same with in get_workload_feature_dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a single splitter to split between vm_query_results and bm_query_results from a single query_result? It could be apply for the feature extract as well.
Here I think it is fine to filter only for bm query since the vm query has vm word which is only available on bm. However, as I mentioned below, I still doubt to use vm value not the node value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will change that now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this as is because the issue is a lot of the methods in DefaultExtractor which imo should be protected are being used directly (like get_workload_feature_data and get_power) in other functions. So I can't really create a single splitter in the DefaultExtractor.extract method. I can also close this because I am right now creating a new Extractor (DefaultVMExtractor) which will include this PR and the new Extractor. @sunya-ch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaiyiLiu1234 Thank you for your contribution on this PR. In my opinion, creating a new DefaultVMExtractor is more preferable.

Switched --vm-train to a flag which produces bool
values.

Signed-off-by: Kaiyi <[email protected]>
Added use_vm_metrics as a field in Offline Trainer.

Signed-off-by: Kaiyi <[email protected]>
sthaha
sthaha previously requested changes Sep 26, 2024
Copy link
Contributor

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we should test if this works to yield better results before we merge it.
What say @sunya-ch @rootfs ?

@KaiyiLiu1234
Copy link
Collaborator Author

IMHO, we should test if this works to yield better results before we merge it. What say @sunya-ch @rootfs ?

This should be sufficient proof: https://github.com/sustainable-computing-io/kepler-metal-ci/blob/main/docs/analytics/vm-cpu-time-predict-vm-power.ipynb - I am going to replicate this in a separate extractor using process instead of container metrics.

@sunya-ch
Copy link
Contributor

sunya-ch commented Sep 27, 2024

@sthaha I see the notebook and it looks good to me. Even though the total CPU time of process inside VM not smooth as it should be but it can still give a clue for modeling.

This is my understanding for the ways to build the model and predict VM's power.

  • BM's total CPU time -> BM's total energy (without running any VM) - current
  • VM's total CPU time -> BM's energy of VM (running VM) - this PR

@KaiyiLiu1234 It would be great if you could add the result of the current version for the comparison. I cannot find total CPU time on BM in the current artifact. Could you point me?

FYI,
I do some more plots from the notebook.

  • CPU times are slightly different from BM's VM and VM's total (expect straight line if equal)
    Screenshot 2024-09-27 at 10 45 00
  • The difference in CPU time observed inside and outside seems not significantly affecting the power model (observed some when the CPU time is low). Train inside = use total CPU time observed inside VM, Train outside = use CPU time observed outside VM. Both use the same power labels which is BM's VM joules.
    Screenshot 2024-09-27 at 11 15 31
  • Note that, any of the above two predictions not including the current approach which use BM's total energy and BM's total CPU time as I cannot find BM's total CPU time in the artifact.

Here is the code to plot. I move all to function and return scaler and popt of the model.

in_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/vm-kepler_process_bpf_cpu_time_ms_total--absolute.json"
bm_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/metal-kepler_vm_bpf_cpu_time_ms_total--absolute.json"

x = np.array(in_vm_cpu_time_values).reshape(len(in_vm_cpu_time_values), 1)
outside_test_features = outside_scaler.transform(x)
inside_test_features = inside_scaler.transform(x)
_, bm_vm_joules_values = url_to_value(joules_url)
y_pred_outside = logi_func(outside_test_features.flatten(), *outside_popt)
y_pred_inside = logi_func(inside_test_features.flatten(), *inside_popt)
y_actual = bm_vm_joules_values
simple_plot('Train outside, Test inside', y_pred_outside, y_actual)
simple_plot('Train inside, Test inside', y_pred_inside, y_actual)

My thought

sorry for long comment. Let me wrap up here.
To conclude,

  • I agree that train inside and test inside is better than train outside and test inside for VM.
  • Not sure whether shall we need to separate the case for the BM that just has no power meter and wants to have power model.
  • We need to add VM configuration on the power model training description.

@KaiyiLiu1234
Copy link
Collaborator Author

@sunya-ch +1 on your comment. Will produce issues for the points you made regarding the setup of this and begin work on that. Regarding your question on total cpu time in BM, it's not included in the validator when saving. It should be in the prometheus snapshot. We can also include it in the validator for future runs.

Copy link
Contributor

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that train inside and test inside is better than train outside and test inside for VM

I worry if we are again falling for the trap of relying on the training data. E.g. have we tried training using VM, then have a different sized vm to test against to see if the accuracy is different? This should we much easy to test now.

I am removing my "Request Changes" in the interest of time

@sthaha sthaha dismissed their stale review September 29, 2024 21:51

Lets test this on metal CI on different sized VM (number of CPU and Memory)

@@ -354,6 +354,7 @@ def get_pipeline(
energy_sources,
valid_feature_groups,
replace_node_type=default_node_type,
use_vm_metrics=False,
Copy link
Contributor

@sthaha sthaha Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow passing selectors to be selected instead of hardcoding use_vm_metrics

E.g. kepler-model train --x-selectors='job="vm"' --x-selectors='foo="bar" --y-selectors='job="metal"'
and keep a default of none, i.e single kepler instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaiyiLiu1234 could you address this in a separate PR ?

@@ -406,6 +406,7 @@ def train_from_data(args):
- --energy-source : specify target energy sources (use comma(,) as delimiter)
- --thirdparty-metrics : specify list of third party metric to export (required only for ThirdParty feature group)
- --id : specify machine ID
- --vm-train: specify whether to use vm feature and energy metrics for training - true: use vm feature metrics. Default is false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated earlier, this should be replace with a way to select the feature-group metrics and a way to select the "target" metrics.

@sunya-ch
Copy link
Contributor

sunya-ch commented Sep 30, 2024

I agree that train inside and test inside is better than train outside and test inside for VM

I worry if we are again falling for the trap of relying on the training data. E.g. have we tried training using VM, then have a different sized vm to test against to see if the accuracy is different? This should we much easy to test now.

Continuing from the above comment, I think we should separate extractor as @KaiyiLiu1234 proposed and we need VM configuration information (as additional metadata) as required for providing the model with that extractor.

It might be also worthy as you mentioned to experiment how the VM size (one of configuration) affects the power model.

Kaiyi Liu added 2 commits September 30, 2024 20:29
When using vm training, enabled process metrics for training
rather than container metrics.

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234
Copy link
Collaborator Author

@rootfs I will merge this for now. When the model server undergoes its improvements, we can change this anyway.

@KaiyiLiu1234 KaiyiLiu1234 merged commit 9b4b832 into sustainable-computing-io:main Oct 22, 2024
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants