-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate ACR token validity after token password rotation #3059
base: master
Are you sure you want to change the base?
Conversation
cf58141
to
341d000
Compare
No commit pushedDate could be found for PR 3059 in repo Azure/ARO-RP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the desire to separate the token validation from the pull-secret controller's reconciliation process, which can possibly error for unrelated reasons, but we will need to keep in mind that what we're validating here isn't actually the thing the cluster is using to authenticate to the container registry. Ideally it eventually becomes that (or part of it), but there is much more that goes into the pull secret during the controller's reconciliation that could affect the final pull secret. Getting it from here is probably a better option though, because there's no chance of accessing customer secrets that could also be on the pull-secret object.
We will also want to keep in mind that we're never revalidating the token, which may become a problem if (or when) password expirations are introduced to the process. However, that is something we can revisit when the time comes.
pkg/util/acrtoken/acrtoken.go
Outdated
tokens: containerregistry.NewTokensClient(env.Environment(), r.SubscriptionID, localFPAuthorizer), | ||
registries: containerregistry.NewRegistriesClient(env.Environment(), r.SubscriptionID, localFPAuthorizer), | ||
newAzAcrClient: azcontainerregistry.NewClient, | ||
getTokenCredential: clusterauthorizer.GetTokenCredential, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this clusterauthorizer.GetTokenCredential
function is for getting an Azure OAuth token, which is not how the clusters authenticate with the container registry. The tokens that is being updated here and that is used in the pull secret is a Basic Auth token in .dockerconfigjson format, and the credentials live in the cluster itself.
pkg/util/acrtoken/acrtoken.go
Outdated
func (m *manager) ValidateToken(ctx context.Context, rp *api.RegistryProfile) error { | ||
creds := clusterauthorizer.Credentials{ | ||
ClientID: []byte(rp.Username), | ||
ClientSecret: []byte(rp.Password), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're using the registry profile that's stored on the cluster doc in CosmosDB to get the token password. I chose not to even worry about updating that password during the rotation process because the only time the cluster cares about that password during the initial cluster installation — that's how the RP passes the username (token name) and password to the cluster during the bootstrap process.
The cluster's "source of truth" for its current ACR password is the cluster
secret in openshift-azure-operator
. That's what it uses during the pullsecret controller's reconciliation to ultimately construct the pull secret that nodes use to get container images. The cluster never accesses the password on the cluster document after installation, so I opted not to as well in order to retain a single source of truth for the password.
As far as testing the validity of the new credentials, it would make sense to me to use the exact same credentials the cluster uses in order to minimize potential differences in data. That means getting the secret from the cluster instead of the RP, and deserializing the token stored on it.
pkg/util/acrtoken/acrtoken.go
Outdated
return err | ||
} | ||
|
||
client, err := m.newAzAcrClient(m.env, fmt.Sprintf("https://%s", m.env.ACRDomain()), token, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container registry's URL is stored as part of the .dockerconfigjson string. You should be able to use that to get a connection to the ACR without having to use the environment interface at all.
Please rebase pull request. |
341d000
to
4151458
Compare
4151458
to
023cece
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is still in draft, but I thought I'd leave some thoughts on the logic. Thanks for your efforts so far!
023cece
to
0e29cc2
Compare
642f582
to
81064e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ValidatePullSecret logic looks good to me! I think the only thing to iron out is when exactly to run it. Right now, were running it on PUCM as well as in the operator, which I think I agree with because we'd get both an operator status and a PUCM failure, but I can see why we'd only really need one or the other since we're always doing the rotation.
d5583ee
to
48c82ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small comments, but for the most part, looking good!
@@ -130,7 +130,7 @@ func TestAdminUpdateSteps(t *testing.T) { | |||
"[Action fixMCSCert-fm]", | |||
"[Action fixMCSUserData-fm]", | |||
"[Action ensureGatewayUpgrade-fm]", | |||
"[Action rotateACRTokenPassword-fm]", | |||
"[Condition rotateAndValidateACRTokenPassword-fm, timeout 5m0s]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we aren't waiting on a reconciliation here, do you think we should just make this an Action
with no timeout instead of leaving it as Condition
?
pkg/util/pullsecret/pullsecret.go
Outdated
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant? It looks like we already returned any possible error on line 198.
pkg/util/pullsecret/pullsecret.go
Outdated
err = fmt.Errorf("credentials format error: %s", registry) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - might as well consolidate these lines since you did it on line 205:
err = fmt.Errorf("credentials format error: %s", registry) | |
return err | |
return fmt.Errorf("credentials format error: %s", registry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about which tokens we're validating in the pull secret.
if err != nil { | ||
return err | ||
} | ||
for registry, authBase64 := range dockerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will check every registry in openshift-config/pull-secret, including user-defined credentials. Are we okay with doing that? And if we are, errors returned from this loop will short-circuit the validation. For example, if there is a pull-secret with both a user-defined registry and the ARO ACR defined and the user secret is checked first and has some sort of error, this will early return on that error and the ARO ACR token will never be validated. I don't know how we feel about using goroutines to check credentials concurrently, but since one credential's validity shouldn't have a bearing on another cred that may be a solution. Alternatively, we could extract the ARO ACR token and only check that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 to extracting the ARO ACR token for validation and disregarding the other tokens.
pkg/util/pullsecret/pullsecret.go
Outdated
err = fmt.Errorf("credentials format error: %s", registry) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
wantAuth map[string]string | ||
wantErr string | ||
client RegistryClient | ||
}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to validate all tokens in the secret then we'll need a test case for a valid ACR token being checked after an invalid user token. I don't remember if go slices are guaranteed to be handled in order in a for
loop but it would be good at least attempt it.
@@ -102,7 +102,7 @@ func (m *manager) adminUpdate() []steps.Step { | |||
if isEverything { | |||
toRun = append(toRun, | |||
steps.Action(m.ensureGatewayUpgrade), | |||
steps.Action(m.rotateACRTokenPassword), | |||
steps.Condition(m.rotateAndValidateACRTokenPassword, 5*time.Minute, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to ensure the operator is running before moving forward with the rotation? You can make sure the operator is ready using the aroDeploymentReady
Condition function.
@@ -195,7 +195,7 @@ func (m *manager) Update(ctx context.Context) error { | |||
steps.Action(m.createOrUpdateDenyAssignment), | |||
steps.Action(m.startVMs), | |||
steps.Condition(m.apiServersReady, 30*time.Minute, true), | |||
steps.Action(m.rotateACRTokenPassword), | |||
steps.Condition(m.rotateAndValidateACRTokenPassword, 5*time.Minute, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question/comment as in adminUpdate
.
if err != nil { | ||
return err | ||
} | ||
for registry, authBase64 := range dockerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 to extracting the ARO ACR token for validation and disregarding the other tokens.
Please rebase pull request. |
cb8ec65
to
55611b5
Compare
55611b5
to
3ac820c
Compare
Please rebase pull request. |
What's the current state of this and can I help get it moving again? |
Which issue this PR addresses:
Fixes ARO-3018
What this PR does / why we need it:
When we rotate ACR tokens via PUCM, it is possible for the token to end up in an ambiguous state, possibly due to a race condition. See slack thread for more context: https://redhat-internal.slack.com/archives/C02ULBRS68M/p1689709898669219
Test plan for issue:
TODO
Is there any documentation that needs to be updated for this PR?
TODO