Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

Discoverer should not retry deleting a service/endpoint that does not exist #134

Open
alexbrand opened this issue Jun 1, 2018 · 12 comments
Labels
discoverer kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alexbrand
Copy link
Contributor

While doing performance testing, I noticed that the discovery components are wasting cycles trying to delete services/endpoints that no longer exist. This is currently impacting other tests (which can be thought of as other gimbal teams) because the discoverer is busy trying to delete non-existent services, instead of discovering new services for other teams.

How does this happen? The performance test creates a namespace where it creates hundreds of services and endpoints in the backend cluster. These are then discovered by Gimbal and everything is fine.

Once the test is done, instead of deleting services/endpoints one by one, the performance test script destroys the test namespace on both the gimbal cluster and the backend cluster. The deletion of the namespace in the backend cluster results in watch notifications sent to the discoverers about the deletion of services. The problem is that the services/endpoints have already been deleted in the gimbal cluster, because the entire namespace was deleted.

@alexbrand alexbrand added kind/feature Categorizes issue or PR as related to a new feature. discoverer labels Jun 1, 2018
@stevesloka
Copy link
Member

How many threads are set up on your discoverer? Could you retest with a higher number? Just curious how that might have (a positive) effect on the performance of the queue.

@alexbrand
Copy link
Contributor Author

Originally had 2, bumped it up to 16 but still seeing this issue. Will try bumping up even higher, but not sure how much it'll help.

@stevesloka
Copy link
Member

Having more should process the queue faster since they should operate in parallel. But just a thought before going down a design route to see how to resolve.

@alexbrand
Copy link
Contributor Author

I haven't thought through the implications, but I think we should be able to swallow the "does not exist" error when we are trying to delete, instead of raising it up to the queue for re-queuing.

What's unclear to me is if we can naturally get into this state by somehow processing a delete notification before processing the corresponding add, and thus end up with an extra service that should have been deleted.

@alexbrand
Copy link
Contributor Author

These charts are interesting.

We can see when the namespace is deleted and all services are gone, but the discoverer is still trying to delete them, and thus accumulating errors.

Looks like it took approximately 20 minutes to drain the queue.

image

@alexbrand
Copy link
Contributor Author

Memory usage of discoverer:

image

@stevesloka
Copy link
Member

Ohh I understand now, I didn't realize that we were re-queuing each time on error. I think we could handle the "does not exist" or put a max retry on the queue, once an item is attempted 2 times, then just drop from the queue. That would solve the same scenario where maybe an update or add errors (for whatever reason) and we keep trying.

@alexbrand
Copy link
Contributor Author

Yeah, I actually think we need both... It doesn't look like we have a max retry on the queue item. I'll pick this up and look into it more.

@stevesloka
Copy link
Member

Not sure if we'd want a retry metric or not t track how many items are re-tried, or thrown out because if error.

@alexbrand
Copy link
Contributor Author

alexbrand commented Jun 7, 2018

Now that we know we were not actually retrying before #141, the issue title was somewhat misleading.

What was actually going on is that the namespace is deleted from under the discoverer on both clusters, resulting in the API server from the backend cluster sending a long list of DELETE watch notifications to the discoverer. This is the expected behavior.

Now that we have added retries though, we should probably not attempt to retry a deletion for a service or endpoint that does not exist.

Another idea (which needs a bit more thought): Have a separate queue per namespace, so that the operations performed by one team do not affect other teams. This would be a larger change that requires discussion and design though (cc @rosskukulinski @stevesloka).

@alexbrand
Copy link
Contributor Author

The more I think about this, the more I am second guessing whether we should do it. It seems like the deletion of a namespace with thousands of services is an unlikely scenario when running Gimbal in production. Is that a reasonable statement? @stevesloka @rosskukulinski

@stevesloka
Copy link
Member

My guess that thousands of services within a namespace is unlikely. I would think though, that bringing up a new cluster could see a high input of data, also, anytime the re-sync window is hit on a large cluster.

This was my reason for the re-queue param, you may not want to retry more than once if something fails.

@alexbrand alexbrand removed their assignment Jun 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
discoverer kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants