Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate ACR token validity after token password rotation #3059

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ventifus
Copy link
Collaborator

Which issue this PR addresses:

Fixes ARO-3018

What this PR does / why we need it:

When we rotate ACR tokens via PUCM, it is possible for the token to end up in an ambiguous state, possibly due to a race condition. See slack thread for more context: https://redhat-internal.slack.com/archives/C02ULBRS68M/p1689709898669219

Test plan for issue:

TODO

Is there any documentation that needs to be updated for this PR?

TODO

pkg/cluster/acrtoken.go Fixed Show resolved Hide resolved
@cadenmarchese cadenmarchese added the chainsaw Pull requests or issues owned by Team Chainsaw label Jul 25, 2023
@ventifus ventifus force-pushed the validate-acr-token branch 3 times, most recently from cf58141 to 341d000 Compare July 25, 2023 22:28
@azure-pipelines
Copy link

No commit pushedDate could be found for PR 3059 in repo Azure/ARO-RP

Copy link
Collaborator

@dem4gus dem4gus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire to separate the token validation from the pull-secret controller's reconciliation process, which can possibly error for unrelated reasons, but we will need to keep in mind that what we're validating here isn't actually the thing the cluster is using to authenticate to the container registry. Ideally it eventually becomes that (or part of it), but there is much more that goes into the pull secret during the controller's reconciliation that could affect the final pull secret. Getting it from here is probably a better option though, because there's no chance of accessing customer secrets that could also be on the pull-secret object.

We will also want to keep in mind that we're never revalidating the token, which may become a problem if (or when) password expirations are introduced to the process. However, that is something we can revisit when the time comes.

tokens: containerregistry.NewTokensClient(env.Environment(), r.SubscriptionID, localFPAuthorizer),
registries: containerregistry.NewRegistriesClient(env.Environment(), r.SubscriptionID, localFPAuthorizer),
newAzAcrClient: azcontainerregistry.NewClient,
getTokenCredential: clusterauthorizer.GetTokenCredential,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this clusterauthorizer.GetTokenCredential function is for getting an Azure OAuth token, which is not how the clusters authenticate with the container registry. The tokens that is being updated here and that is used in the pull secret is a Basic Auth token in .dockerconfigjson format, and the credentials live in the cluster itself.

func (m *manager) ValidateToken(ctx context.Context, rp *api.RegistryProfile) error {
creds := clusterauthorizer.Credentials{
ClientID: []byte(rp.Username),
ClientSecret: []byte(rp.Password),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're using the registry profile that's stored on the cluster doc in CosmosDB to get the token password. I chose not to even worry about updating that password during the rotation process because the only time the cluster cares about that password during the initial cluster installation — that's how the RP passes the username (token name) and password to the cluster during the bootstrap process.

The cluster's "source of truth" for its current ACR password is the cluster secret in openshift-azure-operator. That's what it uses during the pullsecret controller's reconciliation to ultimately construct the pull secret that nodes use to get container images. The cluster never accesses the password on the cluster document after installation, so I opted not to as well in order to retain a single source of truth for the password.

As far as testing the validity of the new credentials, it would make sense to me to use the exact same credentials the cluster uses in order to minimize potential differences in data. That means getting the secret from the cluster instead of the RP, and deserializing the token stored on it.

return err
}

client, err := m.newAzAcrClient(m.env, fmt.Sprintf("https://%s", m.env.ACRDomain()), token, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container registry's URL is stored as part of the .dockerconfigjson string. You should be able to use that to get a connection to the ACR without having to use the environment interface at all.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Aug 22, 2023
@github-actions
Copy link

Please rebase pull request.

Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still in draft, but I thought I'd leave some thoughts on the logic. Thanks for your efforts so far!

pkg/cluster/acrtoken.go Outdated Show resolved Hide resolved
pkg/cluster/acrtoken.go Outdated Show resolved Hide resolved
pkg/cluster/acrtoken.go Outdated Show resolved Hide resolved
pkg/cluster/acrtoken.go Show resolved Hide resolved
@ventifus ventifus force-pushed the validate-acr-token branch 6 times, most recently from 642f582 to 81064e4 Compare October 24, 2023 21:53
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValidatePullSecret logic looks good to me! I think the only thing to iron out is when exactly to run it. Right now, were running it on PUCM as well as in the operator, which I think I agree with because we'd get both an operator status and a PUCM failure, but I can see why we'd only really need one or the other since we're always doing the rotation.

pkg/cluster/adminupdate_test.go Outdated Show resolved Hide resolved
pkg/util/pullsecret/pullsecret.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@cadenmarchese cadenmarchese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments, but for the most part, looking good!

pkg/cluster/acrtoken.go Show resolved Hide resolved
@@ -130,7 +130,7 @@ func TestAdminUpdateSteps(t *testing.T) {
"[Action fixMCSCert-fm]",
"[Action fixMCSUserData-fm]",
"[Action ensureGatewayUpgrade-fm]",
"[Action rotateACRTokenPassword-fm]",
"[Condition rotateAndValidateACRTokenPassword-fm, timeout 5m0s]",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we aren't waiting on a reconciliation here, do you think we should just make this an Action with no timeout instead of leaving it as Condition?

Comment on lines 200 to 202
if err != nil {
return err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant? It looks like we already returned any possible error on line 198.

Comment on lines 197 to 198
err = fmt.Errorf("credentials format error: %s", registry)
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - might as well consolidate these lines since you did it on line 205:

Suggested change
err = fmt.Errorf("credentials format error: %s", registry)
return err
return fmt.Errorf("credentials format error: %s", registry)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@dem4gus dem4gus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about which tokens we're validating in the pull secret.

if err != nil {
return err
}
for registry, authBase64 := range dockerConfig {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will check every registry in openshift-config/pull-secret, including user-defined credentials. Are we okay with doing that? And if we are, errors returned from this loop will short-circuit the validation. For example, if there is a pull-secret with both a user-defined registry and the ARO ACR defined and the user secret is checked first and has some sort of error, this will early return on that error and the ARO ACR token will never be validated. I don't know how we feel about using goroutines to check credentials concurrently, but since one credential's validity shouldn't have a bearing on another cred that may be a solution. Alternatively, we could extract the ARO ACR token and only check that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 to extracting the ARO ACR token for validation and disregarding the other tokens.

Comment on lines 197 to 198
err = fmt.Errorf("credentials format error: %s", registry)
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

wantAuth map[string]string
wantErr string
client RegistryClient
}{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to validate all tokens in the secret then we'll need a test case for a valid ACR token being checked after an invalid user token. I don't remember if go slices are guaranteed to be handled in order in a for loop but it would be good at least attempt it.

@@ -102,7 +102,7 @@ func (m *manager) adminUpdate() []steps.Step {
if isEverything {
toRun = append(toRun,
steps.Action(m.ensureGatewayUpgrade),
steps.Action(m.rotateACRTokenPassword),
steps.Condition(m.rotateAndValidateACRTokenPassword, 5*time.Minute, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to ensure the operator is running before moving forward with the rotation? You can make sure the operator is ready using the aroDeploymentReady Condition function.

@@ -195,7 +195,7 @@ func (m *manager) Update(ctx context.Context) error {
steps.Action(m.createOrUpdateDenyAssignment),
steps.Action(m.startVMs),
steps.Condition(m.apiServersReady, 30*time.Minute, true),
steps.Action(m.rotateACRTokenPassword),
steps.Condition(m.rotateAndValidateACRTokenPassword, 5*time.Minute, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question/comment as in adminUpdate.

if err != nil {
return err
}
for registry, authBase64 := range dockerConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 to extracting the ARO ACR token for validation and disregarding the other tokens.

Copy link

github-actions bot commented Jan 4, 2024

Please rebase pull request.

@github-actions github-actions bot added needs-rebase branch needs a rebase and removed ready-for-review labels Jan 4, 2024
@ventifus ventifus force-pushed the validate-acr-token branch 3 times, most recently from cb8ec65 to 55611b5 Compare February 8, 2024 00:34
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Feb 8, 2024
@github-actions github-actions bot added the needs-rebase branch needs a rebase label May 7, 2024
Copy link

github-actions bot commented May 7, 2024

Please rebase pull request.

@mociarain
Copy link
Collaborator

What's the current state of this and can I help get it moving again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chainsaw Pull requests or issues owned by Team Chainsaw needs-rebase branch needs a rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants