Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] MCO-1437: MCO-1476: MCO-1477: MCO-1284: Adapt MCO to OCL v1 API #4756

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Dec 13, 2024

[DNM] Testing current state of v1 API

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 13, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 13, 2024

@djoshy: This pull request references MCO-1437 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

[DNM] Testing current state of v1 API

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2024
@@ -81,6 +80,7 @@ func TestMachineOSBuild(t *testing.T) {
OSImageURLConfig: fixtures.OSImageURLConfig(),
},
},
/* QOCL: Are these test cases valid anymore?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 6 test cases here that seem to exercise fields that have been removed: BaseOSExtensionsImagePullspe, BaseOSImagePullspec and ReleaseVersion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we still need those fields in order to produce a hashed name for a MachineOSBuild. The difference is that we only have a single canonical source for these values with the new API changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood and I guess my question here is that, since these fields no longer exist and can cause a "different" hash for a build, what these tests are trying to verify by manipulating these fields and examining the result hash can't be done anymore?

ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
/*
QOCL: Need a new failure mode for this as BaseImagePullspec field doesn't exist anymore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the removal of BaseImagePullSpec, I'm not sure the test can be initiated in the same way here, so I'm open to ideas here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do a Containerfile with invalid syntax instead. It will just take a bit longer to produce the failure.

@@ -572,7 +577,8 @@ func TestMCDGetsMachineOSConfigSecrets(t *testing.T) {
})

// Assign the secret name to the MachineOSConfig.
mosc.Spec.BuildOutputs.CurrentImagePullSecret.Name = secretName
// QOCL: This field doesn't exist anymore, is this test valid?
//mosc.Spec.CurrentImagePullSecret.Name = secretName
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario where I believe the test currently depends on this field to run correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole test can be dropped in favor of a test that just validates whether the global image pull secret is written to the nodes' filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already do this as part of the internal registry secret cert writer path, where the global pull secret + internal registry secret is written to disk at /etc/mco/internal-registry-pull-secret.json : 2d77dcf

@@ -306,6 +305,7 @@ func TestRenderAsset(t *testing.T) {
},
},
// Tests that the MCD DaemonSet gets MachineOSConfig secrets mounted into it.
/*QOCL: Another instance of what are we doing with CurrentImagePullSecret?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test that depends on CurrentImagePullSecret

Copy link
Member

@cheesesashimi cheesesashimi Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify things, we can probably remove this test while also removing the volume mounts from the MCD manifest since this is no longer needed. We'd also need to modify the code within the MCD to write the global pull secret to the nodes filesystem, although I think we're already doing that and just need to tell rpm-ostree to use that secret instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you're talking about the internal registry path I mentioned here and I think we already have something that directs rpm-ostree to use the pull secret if that file exists. Correct me if I'm missing something here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pulled out the code specific to the secret mounting and split it off into a separate commit. I've also pulled out all related tests for this particular path - the render unit test mentioned in this thread and the e2e test mentioned here #4756 (comment)

I do agree that it is a good idea to add a test to ensure that global pull secret is written to disk. I think we might have some units for this path on the operator and the daemon level already, but not an e2e - I could be wrong though. Do you think we should add that work to this PR, or open up a new card for it?

@djoshy
Copy link
Contributor Author

djoshy commented Dec 18, 2024

/test unit

@djoshy
Copy link
Contributor Author

djoshy commented Dec 18, 2024

/test unit
/test verify

@djoshy
Copy link
Contributor Author

djoshy commented Dec 18, 2024

/test unit
/test verify

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2024
Comment on lines +305 to +309
Job: &mcfgv1.ObjectReference{
Name: obj.GetName(),
Group: obj.GroupVersionKind().Group,
Group: "tbd-group",
Namespace: obj.GetNamespace(),
Resource: obj.GetResourceVersion(),
Resource: "tbd-resource",
Copy link
Contributor Author

@djoshy djoshy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values now fail API validation. The API requires that all these fields should validate against format.dns1123Subdomain(). In my testing, Group was being assigned an empty value and Resource was being assigned ResourceVersion which is normally a number. Open to suggestions here!

// QOCL: Do we want to fatally exit here?
globalPullSecret, err := optr.ocSecretLister.Secrets(ctrlcommon.OpenshiftConfigNamespace).Get("pull-secret")
if err != nil {
return fmt.Errorf("error fetching cluster pull secret: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to fatally exit here? If this secret is missing in cluster, the build will certainly fail, I'd think.

// If it does exist, check if an update is required before making the update call.
if !reflect.DeepEqual(currentSecretCopy.Data, globalPullSecret.Data) {
klog.Infof("updating %s", ctrlcommon.GlobalPullSecretCopyName)
_, err := optr.kubeClient.CoreV1().Secrets(ctrlcommon.MCONamespace).Update(context.TODO(), globalPullSecretCopy, metav1.UpdateOptions{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this update mechanism is necessary, I don't know if the global secret will ever get updated after installation?


/*
QOCL: This test seems to pass individually, but fails when run with other unit tests. Other tests in this package
// seems to exhibit similar failures but at a lower rate. Possible object conflicts/worth exploring as a new bug?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seemed to be conflicting with other tests for some reason, would appreciate some pointers if I'm missing something!

Comment on lines +1431 to +1438
OwnerReferences: []metav1.OwnerReference{
{
APIVersion: currentPool.APIVersion,
Kind: currentPool.Kind,
Name: currentPool.ObjectMeta.Name,
UID: currentPool.ObjectMeta.UID,
},
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a minor fix for a particular scenario I encountered during e2es: If the pool was deleted prior to being opted out of layering, this sync function wouldn't delete it. Adding the MCP as an owner field seemed like the cleanest fix.

// This test starts a build with an image that is known to fail because it uses
// an invalid containerfile. After failure, it edits the MachineOSConfig
// with the expectation that the failed build and its will be deleted and a new
// build will start in its place.
func TestGracefulBuildFailureRecovery(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has been updated to use a bad container file as suggested in #4756 (comment). It now takes about 17 minutes to run locally, not sure how fast it used to be prior to that!

Comment on lines 805 to 807
imagestreamObjMeta := metav1.ObjectMeta{
Name: "os-image",
Namespace: strings.ToLower(t.Name()),
Name: "os-image",
}
Copy link
Contributor Author

@djoshy djoshy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary because we are no longer able to inject a custom pull secret. By placing the image in the MCO's slice of the registry, the daemon will be able to handle this update with existing secrets.

@djoshy
Copy link
Contributor Author

djoshy commented Dec 20, 2024

This PR has been rebased on the latest forks of the API and client-go from Jerry, so it is fairly close to the shape of the final API. I've left comments above so hopefully we can remember where we left off before the break 🎄

Since we can't run the e2es until the v1 CRDs have landed, I took the liberty of testing the e2es locally and we seemed to be only failing on TestYumReposBuilds. I am unsure if running in CI will fare better results, but just thought I'd mention it in case something I did here did break them! 😄

@djoshy
Copy link
Contributor Author

djoshy commented Dec 20, 2024

/test unit
/test verify

Copy link
Contributor

openshift-ci bot commented Dec 20, 2024

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify b379724 link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants