Discoverer should not retry deleting a service/endpoint that does not exist #134

alexbrand · 2018-06-01T19:28:27Z

While doing performance testing, I noticed that the discovery components are wasting cycles trying to delete services/endpoints that no longer exist. This is currently impacting other tests (which can be thought of as other gimbal teams) because the discoverer is busy trying to delete non-existent services, instead of discovering new services for other teams.

How does this happen? The performance test creates a namespace where it creates hundreds of services and endpoints in the backend cluster. These are then discovered by Gimbal and everything is fine.

Once the test is done, instead of deleting services/endpoints one by one, the performance test script destroys the test namespace on both the gimbal cluster and the backend cluster. The deletion of the namespace in the backend cluster results in watch notifications sent to the discoverers about the deletion of services. The problem is that the services/endpoints have already been deleted in the gimbal cluster, because the entire namespace was deleted.

stevesloka · 2018-06-01T19:54:18Z

How many threads are set up on your discoverer? Could you retest with a higher number? Just curious how that might have (a positive) effect on the performance of the queue.

alexbrand · 2018-06-01T21:02:31Z

Originally had 2, bumped it up to 16 but still seeing this issue. Will try bumping up even higher, but not sure how much it'll help.

stevesloka · 2018-06-01T21:11:24Z

Having more should process the queue faster since they should operate in parallel. But just a thought before going down a design route to see how to resolve.

alexbrand · 2018-06-01T21:17:44Z

I haven't thought through the implications, but I think we should be able to swallow the "does not exist" error when we are trying to delete, instead of raising it up to the queue for re-queuing.

What's unclear to me is if we can naturally get into this state by somehow processing a delete notification before processing the corresponding add, and thus end up with an extra service that should have been deleted.

alexbrand · 2018-06-01T21:20:57Z

These charts are interesting.

We can see when the namespace is deleted and all services are gone, but the discoverer is still trying to delete them, and thus accumulating errors.

Looks like it took approximately 20 minutes to drain the queue.

alexbrand · 2018-06-01T21:23:06Z

Memory usage of discoverer:

stevesloka · 2018-06-02T00:33:16Z

Ohh I understand now, I didn't realize that we were re-queuing each time on error. I think we could handle the "does not exist" or put a max retry on the queue, once an item is attempted 2 times, then just drop from the queue. That would solve the same scenario where maybe an update or add errors (for whatever reason) and we keep trying.

alexbrand · 2018-06-04T13:40:13Z

Yeah, I actually think we need both... It doesn't look like we have a max retry on the queue item. I'll pick this up and look into it more.

stevesloka · 2018-06-04T13:54:12Z

Not sure if we'd want a retry metric or not t track how many items are re-tried, or thrown out because if error.

alexbrand · 2018-06-07T18:54:02Z

Now that we know we were not actually retrying before #141, the issue title was somewhat misleading.

What was actually going on is that the namespace is deleted from under the discoverer on both clusters, resulting in the API server from the backend cluster sending a long list of DELETE watch notifications to the discoverer. This is the expected behavior.

Now that we have added retries though, we should probably not attempt to retry a deletion for a service or endpoint that does not exist.

Another idea (which needs a bit more thought): Have a separate queue per namespace, so that the operations performed by one team do not affect other teams. This would be a larger change that requires discussion and design though (cc @rosskukulinski @stevesloka).

alexbrand · 2018-06-07T19:26:40Z

The more I think about this, the more I am second guessing whether we should do it. It seems like the deletion of a namespace with thousands of services is an unlikely scenario when running Gimbal in production. Is that a reasonable statement? @stevesloka @rosskukulinski

stevesloka · 2018-06-07T19:29:39Z

My guess that thousands of services within a namespace is unlikely. I would think though, that bringing up a new cluster could see a high input of data, also, anytime the re-sync window is hit on a large cluster.

This was my reason for the re-queue param, you may not want to retry more than once if something fails.

alexbrand added kind/feature Categorizes issue or PR as related to a new feature. discoverer labels Jun 1, 2018

alexbrand self-assigned this Jun 4, 2018

alexbrand mentioned this issue Jun 5, 2018

discovery: fix syncqueue retries #141

Merged

alexbrand removed their assignment Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discoverer should not retry deleting a service/endpoint that does not exist #134

Discoverer should not retry deleting a service/endpoint that does not exist #134

alexbrand commented Jun 1, 2018

stevesloka commented Jun 1, 2018

alexbrand commented Jun 1, 2018

stevesloka commented Jun 1, 2018

alexbrand commented Jun 1, 2018

alexbrand commented Jun 1, 2018

alexbrand commented Jun 1, 2018

stevesloka commented Jun 2, 2018

alexbrand commented Jun 4, 2018

stevesloka commented Jun 4, 2018

alexbrand commented Jun 7, 2018 •

edited

Loading

alexbrand commented Jun 7, 2018

stevesloka commented Jun 7, 2018

Discoverer should not retry deleting a service/endpoint that does not exist #134

Discoverer should not retry deleting a service/endpoint that does not exist #134

Comments

alexbrand commented Jun 1, 2018

stevesloka commented Jun 1, 2018

alexbrand commented Jun 1, 2018

stevesloka commented Jun 1, 2018

alexbrand commented Jun 1, 2018

alexbrand commented Jun 1, 2018

alexbrand commented Jun 1, 2018

stevesloka commented Jun 2, 2018

alexbrand commented Jun 4, 2018

stevesloka commented Jun 4, 2018

alexbrand commented Jun 7, 2018 • edited Loading

alexbrand commented Jun 7, 2018

stevesloka commented Jun 7, 2018

alexbrand commented Jun 7, 2018 •

edited

Loading