[DNM] MCO-1437: MCO-1476: MCO-1477: MCO-1284: Adapt MCO to OCL v1 API #4756

djoshy · 2024-12-13T20:51:19Z

[DNM] Testing current state of v1 API

openshift-ci-robot · 2024-12-13T20:51:23Z

@djoshy: This pull request references MCO-1437 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

[DNM] Testing current state of v1 API

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-12-13T20:51:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-12-13T20:51:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controller/build/buildrequest/buildrequest.go

pkg/controller/build/buildrequest/machineosbuild.go

djoshy · 2024-12-13T20:54:23Z

pkg/controller/build/buildrequest/machineosbuild_test.go

@@ -81,6 +80,7 @@ func TestMachineOSBuild(t *testing.T) {
 				OSImageURLConfig:  fixtures.OSImageURLConfig(),
 			},
 		},
+		/* QOCL: Are these test cases valid anymore?


There are 6 test cases here that seem to exercise fields that have been removed: BaseOSExtensionsImagePullspe, BaseOSImagePullspec and ReleaseVersion

In my opinion, we still need those fields in order to produce a hashed name for a MachineOSBuild. The difference is that we only have a single canonical source for these values with the new API changes.

Understood and I guess my question here is that, since these fields no longer exist and can cause a "different" hash for a build, what these tests are trying to verify by manipulating these fields and examining the result hash can't be done anymore?

djoshy · 2024-12-13T20:55:26Z

test/e2e-ocl/onclusterlayering_test.go

-	ctx, cancel := context.WithCancel(context.Background())
-	t.Cleanup(cancel)
+	/*
+		QOCL: Need a new failure mode for this as BaseImagePullspec field doesn't exist anymore


With the removal of BaseImagePullSpec, I'm not sure the test can be initiated in the same way here, so I'm open to ideas here

We can do a Containerfile with invalid syntax instead. It will just take a bit longer to produce the failure.

djoshy · 2024-12-13T20:57:32Z

test/e2e-ocl/onclusterlayering_test.go

@@ -572,7 +577,8 @@ func TestMCDGetsMachineOSConfigSecrets(t *testing.T) {
 	})

 	// Assign the secret name to the MachineOSConfig.
-	mosc.Spec.BuildOutputs.CurrentImagePullSecret.Name = secretName
+	// QOCL: This field doesn't exist anymore, is this test valid?
+	//mosc.Spec.CurrentImagePullSecret.Name = secretName


Another scenario where I believe the test currently depends on this field to run correctly

This whole test can be dropped in favor of a test that just validates whether the global image pull secret is written to the nodes' filesystem.

I believe we already do this as part of the internal registry secret cert writer path, where the global pull secret + internal registry secret is written to disk at /etc/mco/internal-registry-pull-secret.json : 2d77dcf

test/helpers/machineosconfigbuilder.go

djoshy · 2024-12-13T20:58:21Z

pkg/operator/render_test.go

@@ -306,6 +305,7 @@ func TestRenderAsset(t *testing.T) {
 			},
 		},
 		// Tests that the MCD DaemonSet gets MachineOSConfig secrets mounted into it.
+		/*QOCL: Another instance of what are we doing with CurrentImagePullSecret?


Another test that depends on CurrentImagePullSecret

To simplify things, we can probably remove this test while also removing the volume mounts from the MCD manifest since this is no longer needed. We'd also need to modify the code within the MCD to write the global pull secret to the nodes filesystem, although I think we're already doing that and just need to tell rpm-ostree to use that secret instead.

I believe you're talking about the internal registry path I mentioned here and I think we already have something that directs rpm-ostree to use the pull secret if that file exists. Correct me if I'm missing something here!

I've pulled out the code specific to the secret mounting and split it off into a separate commit. I've also pulled out all related tests for this particular path - the render unit test mentioned in this thread and the e2e test mentioned here #4756 (comment)

I do agree that it is a good idea to add a test to ensure that global pull secret is written to disk. I think we might have some units for this path on the operator and the daemon level already, but not an e2e - I could be wrong though. Do you think we should add that work to this PR, or open up a new card for it?

pkg/controller/build/buildrequest/machineosbuild.go

pkg/controller/build/reconciler.go

djoshy · 2024-12-18T17:19:29Z

/test unit

djoshy · 2024-12-18T17:43:09Z

/test unit
/test verify

djoshy · 2024-12-18T21:37:51Z

/test unit
/test verify

djoshy · 2024-12-20T19:24:29Z

pkg/controller/build/imagebuilder/base.go

+		Job: &mcfgv1.ObjectReference{
 			Name:      obj.GetName(),
-			Group:     obj.GroupVersionKind().Group,
+			Group:     "tbd-group",
 			Namespace: obj.GetNamespace(),
-			Resource:  obj.GetResourceVersion(),
+			Resource:  "tbd-resource",


These values now fail API validation. The API requires that all these fields should validate against format.dns1123Subdomain(). In my testing, Group was being assigned an empty value and Resource was being assigned ResourceVersion which is normally a number. Open to suggestions here!

djoshy · 2024-12-20T19:28:22Z

pkg/operator/sync.go

+	// QOCL: Do we want to fatally exit here?
+	globalPullSecret, err := optr.ocSecretLister.Secrets(ctrlcommon.OpenshiftConfigNamespace).Get("pull-secret")
+	if err != nil {
+		return fmt.Errorf("error fetching cluster pull secret: %w", err)


Do we want to fatally exit here? If this secret is missing in cluster, the build will certainly fail, I'd think.

djoshy · 2024-12-20T19:29:14Z

pkg/operator/sync.go

+	// If it does exist, check if an update is required before making the update call.
+	if !reflect.DeepEqual(currentSecretCopy.Data, globalPullSecret.Data) {
+		klog.Infof("updating %s", ctrlcommon.GlobalPullSecretCopyName)
+		_, err := optr.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Update(context.TODO(), globalPullSecretCopy, metav1.UpdateOptions{})


Not sure if this update mechanism is necessary, I don't know if the global secret will ever get updated after installation?

djoshy · 2024-12-20T19:30:40Z

pkg/controller/build/osbuildcontroller_test.go

-
+/*
+QOCL: This test seems to pass individually, but fails when run with other unit tests. Other tests in this package
+// seems to exhibit similar failures but at a lower rate. Possible object conflicts/worth exploring as a new bug?


These tests seemed to be conflicting with other tests for some reason, would appreciate some pointers if I'm missing something!

djoshy · 2024-12-20T19:34:52Z

pkg/operator/sync.go

+				OwnerReferences: []metav1.OwnerReference{
+					{
+						APIVersion: currentPool.APIVersion,
+						Kind:       currentPool.Kind,
+						Name:       currentPool.ObjectMeta.Name,
+						UID:        currentPool.ObjectMeta.UID,
+					},
+				},


This was a minor fix for a particular scenario I encountered during e2es: If the pool was deleted prior to being opted out of layering, this sync function wouldn't delete it. Adding the MCP as an owner field seemed like the cleanest fix.

djoshy · 2024-12-20T19:37:25Z

test/e2e-ocl/onclusterlayering_test.go

+// This test starts a build with an image that is known to fail because it uses
+// an invalid containerfile. After failure, it edits the  MachineOSConfig
+// with the expectation that the failed build and its  will be deleted and a new
+// build will start in its place.
 func TestGracefulBuildFailureRecovery(t *testing.T) {


This test has been updated to use a bad container file as suggested in #4756 (comment). It now takes about 17 minutes to run locally, not sure how fast it used to be prior to that!

djoshy · 2024-12-20T19:40:09Z

test/e2e-ocl/onclusterlayering_test.go

 	imagestreamObjMeta := metav1.ObjectMeta{
-		Name:      "os-image",
-		Namespace: strings.ToLower(t.Name()),
+		Name: "os-image",
 	}


This was necessary because we are no longer able to inject a custom pull secret. By placing the image in the MCO's slice of the registry, the daemon will be able to handle this update with existing secrets.

djoshy · 2024-12-20T20:22:25Z

This PR has been rebased on the latest forks of the API and client-go from Jerry, so it is fairly close to the shape of the final API. I've left comments above so hopefully we can remember where we left off before the break 🎄

Since we can't run the e2es until the v1 CRDs have landed, I took the liberty of testing the e2es locally and we seemed to be only failing on TestYumReposBuilds. I am unsure if running in CI will fare better results, but just thought I'd mention it in case something I did here did break them! 😄

djoshy · 2024-12-20T20:33:41Z

/test unit
/test verify

openshift-ci · 2024-12-20T20:45:06Z

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`b379724`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 13, 2024

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2024