-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Remove cache when catalog is deleted #1207
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
d768396
to
a953950
Compare
a953950
to
40e003b
Compare
I like the direction of this. Maybe not something to do in this PR, but once we have a separate catalog controller, I think it would make sense to have that controller be the thing that also populates Right now, the operator-controller catalog cache is populated on-demand when a ClusterExtension is reconciled. The outcome is that some ClusterExtension reconciles take quite a bit longer because they have to pause to pull and extract catalog content from catalogd. If we moved pull/extract logic to a separate |
40e003b
to
5ca78e0
Compare
0ed5135
to
b607a08
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1207 +/- ##
==========================================
+ Coverage 76.49% 76.60% +0.10%
==========================================
Files 39 40 +1
Lines 2361 2406 +45
==========================================
+ Hits 1806 1843 +37
- Misses 389 395 +6
- Partials 166 168 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
ff58fe9
to
f664d7d
Compare
f664d7d
to
7a2b32f
Compare
cmd/manager/main.go
Outdated
|
||
if err = (&controllers.ClusterCatalogReconciler{ | ||
Client: cl, | ||
Finalizers: clusterCatalogFinalizers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.
WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of:
if err := r.Get(req); errors.IsNotFound(err) {
if err := cache.Cleanup(req); err != nil {
metrics.IncrementCatalogCacheCleanupFailure()
eventrecorder.Event("failed to clear catalog cache for %q", req)
}
}
Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a little bit about using finalizers here. We'll end up with two different controllers that always race when catalogs are created and destroyed, which I'm guessing will add unnecessary reconcile churn.
@joelanford since we only add/remove the finaliser on the object in this controller - the worse thing that can happen here is a conflict. In case of a conflict one of the controllers will re-queue.
Yes, there will be a bit of overhead even without conflicts: one of the controllers will have to do a no-op reconcile at some point. E.g.:
- User creates
ClusterCatalog
- operator-controller reconciles it - add the finaliser
- catalogd reconciles it - adds finaliser, unpacks, etc
- operator-controller reconciles it again - we just exit because the finaliser is already in place (we are already at the desired state)
It can be the other way around around (catalogd reconciles first, operator-controller second). Similar on deletion.
WDYT about a best-effort garbage collector that runs the cleanup logic along the lines of
I think without the finaliser our controller will not be able to reconcile on the level: only on the edge. With that there is a chance that the controller will not see some events. These cases are probably going to be rare: I imagine that there should be some issue (e.g. networking) lasting long enough for etcd to clean up historical versions.
Another option is a garbage collector somewhat like catalogd has here: https://github.com/operator-framework/catalogd/blob/main/internal/garbagecollection/garbage_collector.go
I think there might be time-of-check to time-of-use issue. E.g new catalog is created after cleanup routine lists catalogs, but before it listed FS.
We can probably protect it with a mutex we already have to block reads. But this will mean that we will have to protect listing from the etcd in addition to FS IO and it will increase the blocking time for readers.
I think these two options look more brittle to me and can lead to issues which are going to be (likely) rare, but hard to debug.
I think my preference is to have a bit of overhead in reconciliation, but be explicit with finalisers. But I'm happy to pivot in a different direction.
Let me know what you want me to do here given the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we still go with a "controller" approach that just doesn't modify the ClusterCatalog
resource?
If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of ClusterCatalog
resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.
Does this sound like it could be a good middle ground?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@everettraven I think this is the same as the best-effort option Joe suggested.
Anyway - I updated the controller to not change ClusterCatalog
. Please take another look.
If the pod crashes, is our cache still populated? If so, we could do an initial list + cleanup on startup and then rely on informers to react to events that trigger our cleanup logic without modification of
ClusterCatalog
resources. Once the informer is up I am pretty confident it won't miss any events. Any events we miss are likely because the operator-controller manager container crashed and is going to restart and do the initial list+cleanup.
On pod crash we will end up with filesystem containing cache since we are currently using emptyDir
which persists between crashes. But we will be rewriting the cache for each catalog since this map is not populated on restart. So it looks like using emptyDir
is pointless here.
But my point was not about crashes: I was talking about issues when the manager is still running, but can't connect to API server. In this case it will still think that the cache is still present and valid because it hasn't seen delete event.
Perhaps getting rid of emptyDir
will help? Edit: Ah, we need emptyDir
for unpack cache because it is good to have it persist between pod restarts.
f668757
to
c59e1f4
Compare
Enables us to deletes cache directory for a given catalog from the filesystem. Signed-off-by: Mikalai Radchuk <[email protected]>
c59e1f4
to
9ea5da0
Compare
9ea5da0
to
b6ae265
Compare
if client.IgnoreNotFound(err) != nil { | ||
return ctrl.Result{}, err | ||
} | ||
if apierrors.IsNotFound(err) { | ||
return ctrl.Result{}, r.Cache.Remove(req.Name) | ||
} | ||
|
||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it took me an extra second to parse the IgnoreNotFound
+ IsNotFound
checks. We could simplify to the following.
if client.IgnoreNotFound(err) != nil { | |
return ctrl.Result{}, err | |
} | |
if apierrors.IsNotFound(err) { | |
return ctrl.Result{}, r.Cache.Remove(req.Name) | |
} | |
return ctrl.Result{}, nil | |
if apierrors.IsNotFound(err) { | |
return ctrl.Result{}, r.Cache.Remove(req.Name) | |
} | |
return ctrl.Result{}, err |
If you aren't a fan of return ctrl.Result{}, err
at the end serving the dual purpose of returning nil or non-nil errors, we could also refactor like this:
if client.IgnoreNotFound(err) != nil { | |
return ctrl.Result{}, err | |
} | |
if apierrors.IsNotFound(err) { | |
return ctrl.Result{}, r.Cache.Remove(req.Name) | |
} | |
return ctrl.Result{}, nil | |
if apierrors.IsNotFound(err) { | |
return ctrl.Result{}, r.Cache.Remove(req.Name) | |
} | |
if err != nil { | |
return ctrl.Result{}, err | |
} | |
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the Get()
call? In our SetupWithManager()
method below we filter events to only care about the delete events, so reconcile should only ever be called when a delete event is triggered for a ClusterCatalog
resource.
This would make the Reconcile
method essentially:
func (r *ClusterCatalogReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
return ctrl.Result{}, r.Cache.Remove(req.Name)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, true. We could make it that simple for now.
I'm guessing we will follow-up at some point to do what I mentioned here: #1207 (comment)
But the code is so simple as is that it would be easy to refactor later without accidentally introducing bugs or causing a regression of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already have draft PR where we will be populating cache on non-delete reconciles.
So I think I'll keep the get call & relevant RBAC here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the second suggestion for error handling for the same reason: will be adding more into this controller soon.
return ctrl.Result{}, err | ||
} | ||
if apierrors.IsNotFound(err) { | ||
return ctrl.Result{}, r.Cache.Remove(req.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create one or more new metrics related to this cache? For example:
- gauge: number of known ClusterCatalogs
- gauge: number of local caches
Or maybe just store the diff as a single gauge (that we would expect to always be 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is unreasonable, but I would probably consider this outside the scope of this specific PR. I do think we should eventually have a discussion about instrumenting OLMv1 with meaningful metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #1287 to keep track of this. Are there any labels we'd want to apply to the issue?
New finaliser allows us to remove catalog cache from filesystem on catalog deletion. Signed-off-by: Mikalai Radchuk <[email protected]>
b6ae265
to
6085ee8
Compare
Description
We add a finaliser to
ClusterCatalog
which enables us to remove catalog cache from filesystem on catalog deletion.Fixes #948
Reviewer Checklist