-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(vm_metrics): Enabled vm metric use for local model training #464
feat(vm_metrics): Enabled vm metric use for local model training #464
Conversation
Enabled Training via Pipeline to use vm metrics for features and kepler_vm energy readings for label/output. Signed-off-by: Kaiyi <[email protected]>
Signed-off-by: Kaiyi <[email protected]>
Removed METAL_JOB_NAME as the unit tests does not create "metal" jobs. Signed-off-by: Kaiyi <[email protected]>
if use_vm_metrics: | ||
aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] == VM_JOB_NAME] | ||
else: | ||
aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] != VM_JOB_NAME] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need this else condition. To avoid unexpected job filtering on general case, I think we can filter only when enable use_vm_metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if the filter is not added, job=vm metrics will also be pulled by the model server alongside baremetal metrics which is not a desirable result right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking that the label job is a common label that could be relabeled for example if we want to label the pod with the application job. However, I understand your points that we might want to get the metric of BM also for the case that we enable the VM. After some thought, I think we can keep it as you code.
@@ -72,6 +77,10 @@ def energy_component_to_query(component): | |||
return f"{node_query_prefix}_{component}_{node_query_suffix}" | |||
|
|||
|
|||
def vm_energy_component_to_query(component): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should use the the kepler_vm for training. We should use kepler_node but with job="metal" label (or job != "vm"). kepler_vm is not from the power meter but from the ratio which considering fairness distribution. In the case the we have only single vm, this might not be different but I think we better use the number from power meter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rootfs thoughts? If I were to use kepler_node, I would need to find the process id for the virtual machine. If I recall, kepler_node with vm process id and kepler_vm are the same value or very similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think vm process will be the same if the idle power is not zero since it will distribute evenly for all running process not only the vm process but I think kepler_vm and kepler_node should be the same if you have a single VM.
if query not in query_results: | ||
print(query, "not in", query_results) | ||
return None | ||
aggr_query_data = query_results[query].copy() | ||
if not use_vm_metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please comment an explanation of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if it is baremetal, I do not want it to use job=vm energy metric queries. We should only use baremetal metrics. Basically, I am making sure that the behavior of the extractor prior to my addition of vm metrics is the same as before (which is it will extract all the metrics with kepler_node_*). Same with in get_workload_feature_dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a single splitter to split between vm_query_results and bm_query_results from a single query_result? It could be apply for the feature extract as well.
Here I think it is fine to filter only for bm query since the vm query has vm
word which is only available on bm. However, as I mentioned below, I still doubt to use vm value not the node value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will change that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this as is because the issue is a lot of the methods in DefaultExtractor which imo should be protected are being used directly (like get_workload_feature_data and get_power) in other functions. So I can't really create a single splitter in the DefaultExtractor.extract method. I can also close this because I am right now creating a new Extractor (DefaultVMExtractor) which will include this PR and the new Extractor. @sunya-ch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KaiyiLiu1234 Thank you for your contribution on this PR. In my opinion, creating a new DefaultVMExtractor is more preferable.
Switched --vm-train to a flag which produces bool values. Signed-off-by: Kaiyi <[email protected]>
Added use_vm_metrics as a field in Offline Trainer. Signed-off-by: Kaiyi <[email protected]>
Signed-off-by: Kaiyi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be sufficient proof: https://github.com/sustainable-computing-io/kepler-metal-ci/blob/main/docs/analytics/vm-cpu-time-predict-vm-power.ipynb - I am going to replicate this in a separate extractor using process instead of container metrics. |
@sthaha I see the notebook and it looks good to me. Even though the total CPU time of process inside VM not smooth as it should be but it can still give a clue for modeling. This is my understanding for the ways to build the model and predict VM's power.
@KaiyiLiu1234 It would be great if you could add the result of the current version for the comparison. I cannot find total CPU time on BM in the current artifact. Could you point me? FYI,
Here is the code to plot. I move all to function and return scaler and popt of the model. in_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/vm-kepler_process_bpf_cpu_time_ms_total--absolute.json"
bm_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/metal-kepler_vm_bpf_cpu_time_ms_total--absolute.json"
x = np.array(in_vm_cpu_time_values).reshape(len(in_vm_cpu_time_values), 1)
outside_test_features = outside_scaler.transform(x)
inside_test_features = inside_scaler.transform(x)
_, bm_vm_joules_values = url_to_value(joules_url)
y_pred_outside = logi_func(outside_test_features.flatten(), *outside_popt)
y_pred_inside = logi_func(inside_test_features.flatten(), *inside_popt)
y_actual = bm_vm_joules_values
simple_plot('Train outside, Test inside', y_pred_outside, y_actual)
simple_plot('Train inside, Test inside', y_pred_inside, y_actual) My thoughtsorry for long comment. Let me wrap up here.
|
@sunya-ch +1 on your comment. Will produce issues for the points you made regarding the setup of this and begin work on that. Regarding your question on total cpu time in BM, it's not included in the validator when saving. It should be in the prometheus snapshot. We can also include it in the validator for future runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that train inside and test inside is better than train outside and test inside for VM
I worry if we are again falling for the trap of relying on the training data. E.g. have we tried training using VM, then have a different sized vm to test against to see if the accuracy is different? This should we much easy to test now.
I am removing my "Request Changes" in the interest of time
Lets test this on metal CI on different sized VM (number of CPU and Memory)
@@ -354,6 +354,7 @@ def get_pipeline( | |||
energy_sources, | |||
valid_feature_groups, | |||
replace_node_type=default_node_type, | |||
use_vm_metrics=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow passing selectors to be selected instead of hardcoding use_vm_metrics
E.g. kepler-model train --x-selectors='job="vm"' --x-selectors='foo="bar"
--y-selectors='job="metal"'
and keep a default of none, i.e single kepler instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KaiyiLiu1234 could you address this in a separate PR ?
@@ -406,6 +406,7 @@ def train_from_data(args): | |||
- --energy-source : specify target energy sources (use comma(,) as delimiter) | |||
- --thirdparty-metrics : specify list of third party metric to export (required only for ThirdParty feature group) | |||
- --id : specify machine ID | |||
- --vm-train: specify whether to use vm feature and energy metrics for training - true: use vm feature metrics. Default is false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As indicated earlier, this should be replace with a way to select the feature-group
metrics and a way to select the "target" metrics.
Continuing from the above comment, I think we should separate extractor as @KaiyiLiu1234 proposed and we need VM configuration information (as additional metadata) as required for providing the model with that extractor. It might be also worthy as you mentioned to experiment how the VM size (one of configuration) affects the power model. |
When using vm training, enabled process metrics for training rather than container metrics. Signed-off-by: Kaiyi Liu <[email protected]>
…with vm Signed-off-by: Kaiyi Liu <[email protected]>
c8a6085
to
fb45b43
Compare
@rootfs I will merge this for now. When the model server undergoes its improvements, we can change this anyway. |
9b4b832
into
sustainable-computing-io:main
Currently, the model server training setup uses baremetal metrics only. Included a "use_vm_metrics" field to enable model training with job='vm' metrics and kepler_vm energy sources. This will be useful set up for further testing.