Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ARO operator Azure auth scheme to use a DefaultAzureCredential #3274

Merged

Conversation

kimorris27
Copy link
Contributor

@kimorris27 kimorris27 commented Nov 8, 2023

Which issue this PR addresses:

What this PR does / why we need it:

This PR updates the ARO operator to authenticate to Azure using a DefaultAzureCredential, which will work with both a CSP and WI.

To get it working with a CSP, I had to make a few other changes:

  • Have the ARO operator get the Azure authentication details from Secret values that are put into the Pod via environment variables
  • Add a CredentialsRequest to the ARO operator deployment, which tells the CCO (cloud credential operator) to create an Azure creds Secret in the ARO operator's namespace
  • Restart the ARO operator (master only) when customers rotate their CSP credentials (to ensure the new Secret value is read in). The az aro update process will also now verify that the CredentialsRequest mentioned in the bullet above has been reconciled (and thus the Secret has been updated) before doing this restart.

Note that to get the ARO operator working on an ARO WI cluster later on, we'll need a follow-up PR to have the operator Pod 1) use AZURE_FEDERATED_TOKEN_FILE instead of AZURE_CLIENT_SECRET and 2) mount a bound service account token (more info in testing steps below because I had to do it manually; the testing steps could also be a useful reference for the engineer who makes the follow-up PR).

Test plan for issue:

There were a few different things to test:

  • The new authentication scheme works with a CSP cluster
    • Tested by provisioning a dev cluster and observing the operator successfully doing stuff to Azure resources in response to changes I was making on the cluster
    • Will also be covered in E2E
  • The operator still authenticates to Azure successfully and continues to work after rotating the CSP's client secret via an az aro update --refresh-credentials
    1. I ran an az aro update --refresh-credentials against my dev cluster and observed the ARO operator master Pod get replaced during the Deployment rollout.
    2. I then used the Portal to remove the Microsoft.Storage service endpoint from both cluster subnets.
    3. Then I scaled down one of the worker MachineSets to trigger the subnets controller. The subnets controller added the service endpoint back successfully.
    • I observed some 401 Azure authentication errors immediately after the new operator master Pod started up. It seems like there is short lag before the correct Secret value makes it into the Pod, but it eventually worked with no intervention (per my test results). Just mentioning here because it was interesting to me.
  • The new authentication scheme works with a WI (tested on a vanilla OCP WI cluster)
    1. I provisioned a WI cluster using the instructions at [1], but before running ccoctl azure create-all, I copied the ARO operator's CredentialsRequest manifest into the --credentials-request-dir so that ccoctl would create a WI and a Secret manifest for the ARO operator. I also removed this Secret manifest from the install manifest directory before installing the cluster and saved the manifest so I could apply it myself later.
    2. After the cluster finished installing, I added the ARO Cluster CRD to the cluster and created a minimal Cluster CR with just the attributes needed to make the subnets controller work. I included these attributes:
      • azEnvironment
      • clusterResourceGroupId
      • location
      • operatorFlags (disabled everything except the subnets controller and service endpoint reconciliation)
      • resourceId (just put any resource ID that has the right subscription; the subnet controller only reads this to get the subscription ID)
      • serviceSubnets:
      • storageSuffix
      • vnetId
    3. I then set up the ARO operator (master only, since only the master Pod makes Azure API calls) on the cluster using the manifests in pkg/operator/deploy and the Secret manifest that I had saved during step 1. I did make a few changes to the Deployment manifest: 1) replaced AZURE_CLIENT_SECRET with AZURE_FEDERATED_TOKEN_FILE and 2) mounted a bound service account token (see [2] for the things to add).

[1] https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md

[2] Two things to add in the ARO operator Deployment spec to mount the bound service account token:

  1. Add this to the Pod spec:
volumes:
- name: bound-sa-token
  projected:
  defaultMode: 420
  sources:
  - serviceAccountToken:
      audience: openshift
      expirationSeconds: 3600
      path: token
  1. Add this to the container spec:
volumeMounts:
- mountPath: /var/run/secrets/openshift/serviceaccount
  name: bound-sa-token
  readOnly: true

Is there any documentation that needs to be updated for this PR?

No

Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, looks great! I just have a question about the continued use of an old function.

pkg/cluster/install.go Outdated Show resolved Hide resolved
pkg/util/clusterauthorizer/authorizer.go Outdated Show resolved Hide resolved
pkg/util/clusterauthorizer/authorizer.go Outdated Show resolved Hide resolved
@kimorris27 kimorris27 force-pushed the aro-operator-auth-update-for-wimi branch from f9405b6 to ad91451 Compare November 13, 2023 16:29
Copy link
Collaborator

@bennerv bennerv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - just a quick first pass on this

pkg/util/kubernetes/deployments.go Outdated Show resolved Hide resolved
d.Spec.Template.ObjectMeta.Annotations["kubectl.kubernetes.io/restartedAt"] = time.Now().Format(time.RFC3339)

_, err = cli.Update(ctx, d, metav1.UpdateOptions{})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - no new line

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we patch this? We've been having issues with running updates instead of patches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making this a patch call, especially considering we just want to ensure an annotation is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I became aware while changing this over to a patch call that there are multiple ways to patch. After looking at the different options, I went with server-side apply. Let me know if we should do it a different way though.

pkg/operator/deploy/deploy.go Outdated Show resolved Hide resolved
@@ -201,6 +201,8 @@ func (m *manager) Update(ctx context.Context) error {
steps.Action(m.renewMDSDCertificate),
steps.Action(m.updateOpenShiftSecret),
steps.Action(m.updateAROSecret),
steps.Action(m.restartAROOperatorMaster), // depends on m.updateOpenShiftSecret; the point of restarting is to pick up any changes made to the secret
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to wait for the reconciliation of the credentialsRequest before we restart it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think we want to, but I'll have to think about how to check that the reconciliation is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did somewhat of a deep dive while figuring out how to go about this. The CredentialsRequest's status has a lastSyncCloudCredsSecretResourceVersion that we could check, but it feels hacky because we would have to save the old version prior to updating the secret so that we can see what it used to be. I'm thinking we can just use the lastSyncTimestamp instead by checking to see if it's something within the past 5 minutes. Your thoughts?

It's also tricky because CredentialsRequests haven't made their way into openshift/client-go. Based on my reading, I'll have to use https://pkg.go.dev/k8s.io/client-go/dynamic#DynamicClient to work with the CredentialsRequest from within pkg/cluster, but let me know if there's a better way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a solution that uses the lastSyncTimestamp, and I feel like this should be fine.

@@ -22,18 +22,14 @@ type servicePrincipalChecker interface {
type checker struct {
log *logrus.Entry

credentials func(ctx context.Context) (*clusterauthorizer.Credentials, error)
getTokenCredential func(azEnv *azureclient.AROEnvironment, credentials *clusterauthorizer.Credentials) (azcore.TokenCredential, error)
getTokenCredential func(azEnv *azureclient.AROEnvironment) (azcore.TokenCredential, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this as a struct? So that the controller can fetch a new token credential every time it needs to run the checker, rather than once on startup?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if this is "legacy". The newer azure-sdk-for-go no longer requires us to refresh the token iirc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this as a struct? So that the controller can fetch a new token credential every time it needs to run the checker, rather than once on startup?

I'm not 100% sure, but it seems that way to me.

I'm also wondering if this is "legacy". The newer azure-sdk-for-go no longer requires us to refresh the token iirc.

I wasn't aware of this. I'll have to look into it a bit, and then maybe we can remove unneeded stuff from the struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some digging, I found that you're right - as long as we're using azidentity, our tokens will get auto-refreshed before they expire. So I'll see about refactoring this struct a bit.

For context, #2667 switched us over to azidentity, which uses MSAL under the hood, and https://learn.microsoft.com/en-us/entra/identity-platform/msal-overview states that MSAL refreshes tokens automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, after looking at this code in more detail and considering how we might refactor it, I think we might as well leave it how it is.

It looks like for all of the checkers under pkg/operator/controllers/checkers, there's a pattern of defining a SomethingChecker interface and then creating a struct type that implements the interface. So it logically makes sense to have that struct type encapsulate anything related to the Check that the controller in question does. I think getTokenCredential and newSPValidator are included in the struct more because its logical than because of a goal of getting a new token each time the controller reconciles (which is what I was originally thinking).

Additionally, if you look at the storage accounts controller for example (specifically

authorizer, err := azRefreshAuthorizer.NewRefreshableAuthorizerToken(ctx)
), it also "manually" creates a new token credential each time it Reconciles. So if we wanted to refactor to only create a token credential on controller startup, we'd probably want to do that to all of the controllers that leverage Azure clients and not just the service principal checker.

Let me know what you think!

Copy link
Contributor

@s-amann s-amann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass lgtm, some comments below. If you haven't already done so, please rebase.

pkg/operator/deploy/deploy.go Outdated Show resolved Hide resolved
d.Spec.Template.ObjectMeta.Annotations["kubectl.kubernetes.io/restartedAt"] = time.Now().Format(time.RFC3339)

_, err = cli.Update(ctx, d, metav1.UpdateOptions{})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making this a patch call, especially considering we just want to ensure an annotation is added.

Copy link
Collaborator

@s-fairchild s-fairchild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than some additional test coverage if possible. If that's too much for this PR then maybe in a follow up PR.
Thanks for the neat commit structure.

pkg/util/kubernetes/deployments.go Outdated Show resolved Hide resolved
pkg/cluster/arooperator.go Show resolved Hide resolved
@cadenmarchese cadenmarchese added the next-release To be included in the next RP release rollout label Nov 17, 2023
@kimorris27 kimorris27 force-pushed the aro-operator-auth-update-for-wimi branch from ad91451 to 474d638 Compare November 17, 2023 21:45
- Removed a test from pkg/util/clusterauthorizer; there's no longer
  anything to test since we don't have direct access to the Azure
  credential data when using a DefaultAzure Credential
- Fixed misc. unit tests to reflect changes to GetTokenCredential in
  pkg/util/clusterauthorizer
- Simplified checking logic in certain places
- Renamed an import to make the linter happy
- Removed now unused AzCredentials function
- Changed ARO operator deployment wait time during `az aro update` from
  20 minutes -> 5 minutes
- Updated Restart in pkg/util/kubernetes to use server-side apply
- Updated Restart in pkg/operator/deploy to only return an error after
  at least attempting to restart all of the deployments passed in
- Unit tests for Restart in pkg/util/kubernetes
- E2E test for ARO operator master deployment's restart upon cluster update
@kimorris27 kimorris27 force-pushed the aro-operator-auth-update-for-wimi branch from cc35b92 to a5f5076 Compare November 22, 2023 01:20
s-fairchild
s-fairchild previously approved these changes Nov 22, 2023
Copy link
Collaborator

@s-fairchild s-fairchild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding test coverage, lgtm.

test/e2e/update.go Show resolved Hide resolved
// ServicePrincipalProfile to the contents of the kube-system/azure-cloud-provider Secret. If the CSP
// has changed, it returns true along with a new corev1.Secret to use to update the Secret to match
// what's in the cluster doc.
func (m *manager) servicePrincipalUpdated(ctx context.Context) (bool, *corev1.Secret, error) {
var changed bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible I'd rather see the code that is in the if changed block extracted to its own helper function and used along with a return when we need to set changed=true. this avoids the use of a tracking boolean and reduces complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see what you mean. I've pushed a commit that I believe implements this the way you have in mind.

@kimorris27
Copy link
Contributor Author

/azp run ci

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@s-fairchild s-fairchild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested both a new install and PUCM with these changes to confirm behavior looks correct - lgtm, thanks for addressing all of the changes!

@cadenmarchese cadenmarchese merged commit 9a9edac into Azure:master Nov 28, 2023
18 checks passed
ventifus pushed a commit to ventifus/ARO-RP that referenced this pull request Feb 7, 2024
…Azure#3274)

* Update the cluster authorizer to use a DefaultAzureCredential

* Update the ARO operator to set and use DefaultAzureCredential via env vars

* Add a CredentialsRequest to the ARO operator deployment

* Restart the ARO operator upon `az aro update`

* Removed now unused AzCredentials function

* Changed ARO operator deployment wait time during `az aro update` from
  20 minutes -> 5 minutes

* Refactor CliWithApply to generalize to different object types

* Updated Restart in pkg/util/kubernetes to use server-side apply
* Updated Restart in pkg/operator/deploy to only return an error after
  at least attempting to restart all of the deployments passed in

* E2E test for ARO operator master deployment's restart upon cluster update

* Wait for the ARO operator's CredentialsRequest to be reconciled before
restarting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chainsaw Pull requests or issues owned by Team Chainsaw next-release To be included in the next RP release rollout ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants