feat(vm_metrics): Enabled vm metric use for local model training #464

KaiyiLiu1234 · 2024-09-23T23:35:28Z

Currently, the model server training setup uses baremetal metrics only. Included a "use_vm_metrics" field to enable model training with job='vm' metrics and kepler_vm energy sources. This will be useful set up for further testing.

Enabled Training via Pipeline to use vm metrics for features and kepler_vm energy readings for label/output. Signed-off-by: Kaiyi <[email protected]>

Signed-off-by: Kaiyi <[email protected]>

Removed METAL_JOB_NAME as the unit tests does not create "metal" jobs. Signed-off-by: Kaiyi <[email protected]>

src/kepler_model/cmd/main.py

sunya-ch · 2024-09-24T07:47:48Z

src/kepler_model/train/extractor/extractor.py

+                if use_vm_metrics:
+                    aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] == VM_JOB_NAME]
+                else:
+                    aggr_query_data = aggr_query_data.loc[aggr_query_data["job"] != VM_JOB_NAME]


Don't think we need this else condition. To avoid unexpected job filtering on general case, I think we can filter only when enable use_vm_metrics.

My concern is that if the filter is not added, job=vm metrics will also be pulled by the model server alongside baremetal metrics which is not a desirable result right?

Just thinking that the label job is a common label that could be relabeled for example if we want to label the pod with the application job. However, I understand your points that we might want to get the metric of BM also for the case that we enable the VM. After some thought, I think we can keep it as you code.

sunya-ch · 2024-09-24T07:53:07Z

src/kepler_model/util/prom_types.py

@@ -72,6 +77,10 @@ def energy_component_to_query(component):
    return f"{node_query_prefix}_{component}_{node_query_suffix}"


+def vm_energy_component_to_query(component):


I don't think we should use the the kepler_vm for training. We should use kepler_node but with job="metal" label (or job != "vm"). kepler_vm is not from the power meter but from the ratio which considering fairness distribution. In the case the we have only single vm, this might not be different but I think we better use the number from power meter.

@rootfs thoughts? If I were to use kepler_node, I would need to find the process id for the virtual machine. If I recall, kepler_node with vm process id and kepler_vm are the same value or very similar.

I don't think vm process will be the same if the idle power is not zero since it will distribute evenly for all running process not only the vm process but I think kepler_vm and kepler_node should be the same if you have a single VM.

sunya-ch · 2024-09-24T07:54:25Z

src/kepler_model/train/extractor/extractor.py

            if query not in query_results:
                print(query, "not in", query_results)
                return None
            aggr_query_data = query_results[query].copy()
+            if not use_vm_metrics:


Could you please comment an explanation of this line?

So if it is baremetal, I do not want it to use job=vm energy metric queries. We should only use baremetal metrics. Basically, I am making sure that the behavior of the extractor prior to my addition of vm metrics is the same as before (which is it will extract all the metrics with kepler_node_*). Same with in get_workload_feature_dataset.

can we have a single splitter to split between vm_query_results and bm_query_results from a single query_result? It could be apply for the feature extract as well.
Here I think it is fine to filter only for bm query since the vm query has vm word which is only available on bm. However, as I mentioned below, I still doubt to use vm value not the node value.

Good point. I will change that now

Can we merge this as is because the issue is a lot of the methods in DefaultExtractor which imo should be protected are being used directly (like get_workload_feature_data and get_power) in other functions. So I can't really create a single splitter in the DefaultExtractor.extract method. I can also close this because I am right now creating a new Extractor (DefaultVMExtractor) which will include this PR and the new Extractor. @sunya-ch

@KaiyiLiu1234 Thank you for your contribution on this PR. In my opinion, creating a new DefaultVMExtractor is more preferable.

src/kepler_model/train/pipeline.py

Switched --vm-train to a flag which produces bool values. Signed-off-by: Kaiyi <[email protected]>

Added use_vm_metrics as a field in Offline Trainer. Signed-off-by: Kaiyi <[email protected]>

Signed-off-by: Kaiyi <[email protected]>

sthaha

IMHO, we should test if this works to yield better results before we merge it.
What say @sunya-ch @rootfs ?

KaiyiLiu1234 · 2024-09-26T16:13:44Z

IMHO, we should test if this works to yield better results before we merge it. What say @sunya-ch @rootfs ?

This should be sufficient proof: https://github.com/sustainable-computing-io/kepler-metal-ci/blob/main/docs/analytics/vm-cpu-time-predict-vm-power.ipynb - I am going to replicate this in a separate extractor using process instead of container metrics.

sunya-ch · 2024-09-27T02:19:31Z

@sthaha I see the notebook and it looks good to me. Even though the total CPU time of process inside VM not smooth as it should be but it can still give a clue for modeling.

This is my understanding for the ways to build the model and predict VM's power.

BM's total CPU time -> BM's total energy (without running any VM) - current
VM's total CPU time -> BM's energy of VM (running VM) - this PR

@KaiyiLiu1234 It would be great if you could add the result of the current version for the comparison. I cannot find total CPU time on BM in the current artifact. Could you point me?

FYI,
I do some more plots from the notebook.

CPU times are slightly different from BM's VM and VM's total (expect straight line if equal)
The difference in CPU time observed inside and outside seems not significantly affecting the power model (observed some when the CPU time is low). Train inside = use total CPU time observed inside VM, Train outside = use CPU time observed outside VM. Both use the same power labels which is BM's VM joules.

Note that, any of the above two predictions not including the current approach which use BM's total energy and BM's total CPU time as I cannot find BM's total CPU time in the artifact.

Here is the code to plot. I move all to function and return scaler and popt of the model.

in_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/vm-kepler_process_bpf_cpu_time_ms_total--absolute.json"
bm_vm_cpu_time_url = "https://raw.githubusercontent.com/sustainable-computing-io/kepler-metal-ci/refs/heads/main/docs/validation/2024-09-25/validator-v0.7.11-244-g99b1f94a/artifacts/metal-kepler_vm_bpf_cpu_time_ms_total--absolute.json"

x = np.array(in_vm_cpu_time_values).reshape(len(in_vm_cpu_time_values), 1)
outside_test_features = outside_scaler.transform(x)
inside_test_features = inside_scaler.transform(x)
_, bm_vm_joules_values = url_to_value(joules_url)
y_pred_outside = logi_func(outside_test_features.flatten(), *outside_popt)
y_pred_inside = logi_func(inside_test_features.flatten(), *inside_popt)
y_actual = bm_vm_joules_values
simple_plot('Train outside, Test inside', y_pred_outside, y_actual)
simple_plot('Train inside, Test inside', y_pred_inside, y_actual)

My thought

sorry for long comment. Let me wrap up here.
To conclude,

I agree that train inside and test inside is better than train outside and test inside for VM.
Not sure whether shall we need to separate the case for the BM that just has no power meter and wants to have power model.
We need to add VM configuration on the power model training description.

KaiyiLiu1234 · 2024-09-27T05:36:56Z

@sunya-ch +1 on your comment. Will produce issues for the points you made regarding the setup of this and begin work on that. Regarding your question on total cpu time in BM, it's not included in the validator when saving. It should be in the prometheus snapshot. We can also include it in the validator for future runs.

sthaha

I agree that train inside and test inside is better than train outside and test inside for VM

I worry if we are again falling for the trap of relying on the training data. E.g. have we tried training using VM, then have a different sized vm to test against to see if the accuracy is different? This should we much easy to test now.

I am removing my "Request Changes" in the interest of time

Lets test this on metal CI on different sized VM (number of CPU and Memory)

sthaha · 2024-09-29T22:17:42Z

src/kepler_model/cmd/cmd_util.py

@@ -354,6 +354,7 @@ def get_pipeline(
    energy_sources,
    valid_feature_groups,
    replace_node_type=default_node_type,
+    use_vm_metrics=False,


I think we should allow passing selectors to be selected instead of hardcoding use_vm_metrics

E.g. kepler-model train --x-selectors='job="vm"' --x-selectors='foo="bar" --y-selectors='job="metal"'
and keep a default of none, i.e single kepler instance.

@KaiyiLiu1234 could you address this in a separate PR ?

sthaha · 2024-09-29T22:59:33Z

src/kepler_model/cmd/main.py

@@ -406,6 +406,7 @@ def train_from_data(args):
 - --energy-source : specify target energy sources (use comma(,) as delimiter) 
 - --thirdparty-metrics : specify list of third party metric to export (required only for ThirdParty feature group)
 - --id : specify machine ID 
+- --vm-train: specify whether to use vm feature and energy metrics for training - true: use vm feature metrics. Default is false


As indicated earlier, this should be replace with a way to select the feature-group metrics and a way to select the "target" metrics.

sunya-ch · 2024-09-30T00:55:34Z

I agree that train inside and test inside is better than train outside and test inside for VM

I worry if we are again falling for the trap of relying on the training data. E.g. have we tried training using VM, then have a different sized vm to test against to see if the accuracy is different? This should we much easy to test now.

Continuing from the above comment, I think we should separate extractor as @KaiyiLiu1234 proposed and we need VM configuration information (as additional metadata) as required for providing the model with that extractor.

It might be also worthy as you mentioned to experiment how the VM size (one of configuration) affects the power model.

When using vm training, enabled process metrics for training rather than container metrics. Signed-off-by: Kaiyi Liu <[email protected]>

…with vm Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 · 2024-10-22T17:20:38Z

@rootfs I will merge this for now. When the model server undergoes its improvements, we can change this anyway.

KaiyiLiu1234 added 2 commits September 23, 2024 19:29

feat(vm_metrics_train): Added vm metrics for Training

2574781

Enabled Training via Pipeline to use vm metrics for features and kepler_vm energy readings for label/output. Signed-off-by: Kaiyi <[email protected]>

fix(code_format): Formatted code

d642d15

Signed-off-by: Kaiyi <[email protected]>

KaiyiLiu1234 requested review from sthaha, rootfs and sunya-ch September 23, 2024 23:35

fix(vm_metrics): Remove METAL_JOB_NAME

50138e4

Removed METAL_JOB_NAME as the unit tests does not create "metal" jobs. Signed-off-by: Kaiyi <[email protected]>

sunya-ch reviewed Sep 24, 2024

View reviewed changes

rootfs reviewed Sep 24, 2024

View reviewed changes

src/kepler_model/train/pipeline.py Show resolved Hide resolved

KaiyiLiu1234 added 2 commits September 24, 2024 11:36

fix(vm_metrics): Switch --vm-train to a flag

12cfd1b

Switched --vm-train to a flag which produces bool values. Signed-off-by: Kaiyi <[email protected]>

feat(vm_metrics): Add use_vm_metrics to Offline Trainer

d6a9991

Added use_vm_metrics as a field in Offline Trainer. Signed-off-by: Kaiyi <[email protected]>

rootfs approved these changes Sep 24, 2024

View reviewed changes

chore(code_format): Formatted code

c518754

Signed-off-by: Kaiyi <[email protected]>

sthaha previously requested changes Sep 26, 2024

View reviewed changes

sthaha reviewed Sep 29, 2024

View reviewed changes

Kaiyi Liu added 2 commits September 30, 2024 20:29

feat(process): enable process metrics for vm training

3af9ab9

When using vm training, enabled process metrics for training rather than container metrics. Signed-off-by: Kaiyi Liu <[email protected]>

logs(model_trainer): Add print logs to confirm use of kepler_process …

fb45b43

…with vm Signed-off-by: Kaiyi Liu <[email protected]>

KaiyiLiu1234 force-pushed the add-vm-metrics-train branch from c8a6085 to fb45b43 Compare October 1, 2024 16:36

Merge branch 'main' into add-vm-metrics-train

c5f8ec5

KaiyiLiu1234 merged commit 9b4b832 into sustainable-computing-io:main Oct 22, 2024
16 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vm_metrics): Enabled vm metric use for local model training #464

feat(vm_metrics): Enabled vm metric use for local model training #464

KaiyiLiu1234 commented Sep 23, 2024

sunya-ch Sep 24, 2024

KaiyiLiu1234 Sep 24, 2024

sunya-ch Sep 24, 2024

sunya-ch Sep 24, 2024

KaiyiLiu1234 Sep 24, 2024

sunya-ch Sep 25, 2024

sunya-ch Sep 24, 2024

KaiyiLiu1234 Sep 24, 2024

sunya-ch Sep 24, 2024

KaiyiLiu1234 Sep 24, 2024

KaiyiLiu1234 Sep 26, 2024

sunya-ch Sep 30, 2024

sthaha left a comment

KaiyiLiu1234 commented Sep 26, 2024

sunya-ch commented Sep 27, 2024 •

edited

Loading

KaiyiLiu1234 commented Sep 27, 2024

sthaha left a comment

sthaha Sep 29, 2024 •

edited

Loading

sthaha Sep 29, 2024

sthaha Sep 29, 2024

sunya-ch commented Sep 30, 2024 •

edited

Loading

KaiyiLiu1234 commented Oct 22, 2024

		@@ -72,6 +77,10 @@ def energy_component_to_query(component):
		return f"{node_query_prefix}_{component}_{node_query_suffix}"


		def vm_energy_component_to_query(component):

feat(vm_metrics): Enabled vm metric use for local model training #464

feat(vm_metrics): Enabled vm metric use for local model training #464

Conversation

KaiyiLiu1234 commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sthaha left a comment

Choose a reason for hiding this comment

KaiyiLiu1234 commented Sep 26, 2024

sunya-ch commented Sep 27, 2024 • edited Loading

My thought

KaiyiLiu1234 commented Sep 27, 2024

sthaha left a comment

Choose a reason for hiding this comment

sthaha Sep 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunya-ch commented Sep 30, 2024 • edited Loading

KaiyiLiu1234 commented Oct 22, 2024

sunya-ch commented Sep 27, 2024 •

edited

Loading

sthaha Sep 29, 2024 •

edited

Loading

sunya-ch commented Sep 30, 2024 •

edited

Loading