Implement space deletion job endpoint #2647

davewalter · 2023-06-29T00:40:49Z

Is there a related GitHub Issue?

#2604

What is this change about?

Return the status of space deletion based on the underlying CFSpace resource state.

Does this PR introduce a breaking change?

No.

Acceptance Steps

See issue.

Tag your pair, your PM, and/or team

Paired with @acosta11

api/handlers/job.go

kieron-dev · 2023-06-30T09:37:39Z

I'm concerned about the new dynamic client here. If you're concerned about caching, I believe this is just another non-caching client, as the privileged and user clients here are already not using a cache.

gcapizzi · 2023-06-30T09:56:30Z

We are pretty sure that both clients are actually uncached. We tried to put GetOrg back in and see if we could make cf delete-org fail by running in a loop, but without success.

Maybe the condition you caught is not due to caching but just general etcd eventual consistency? In that case maybe we should do retries?

I would avoid introducing a new client unless we can prove that it makes a difference.

davewalter · 2023-06-30T23:14:50Z

Thank you both for taking a look at this. 🙏

I believe this is just another non-caching client, as the privileged and user clients here are already not using a cache.

That was our understanding as well.

Maybe the condition you caught is not due to caching but just general etcd eventual consistency? In that case maybe we should do retries?

The thing that led us to believe that the client used by the API was using a cache was that @matt-royal consistently encountered an issue when testing on a KinD cluster on his M1 Mac where the first GET call to /v3/jobs that the CF CLI makes immediately after the delete request was returning a 404 error which only happens if the deletion timestamp of the specified org is empty. I was not able to reproduce this on my i9 Mac, but @clintyoshimura and I were able to reproduce this on an AKS cluster. I just deployed from the main branch onto a KinD cluster and was finally able to reproduce it locally:

$ cf delete-org o -v; kubectl get -n cf cforg/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 -o yaml
...
Deleting org o as cf-admin...
REQUEST: [2023-06-30T13:58:24-07:00]
GET /v3/organizations?names=o HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 30 Jun 2023 20:58:24 GMT
Server: envoy
Vary: Accept-Encoding
X-Correlation-Id: 81578357-659d-44ff-a2db-aedf9a12948f
X-Envoy-Upstream-Service-Time: 9
{
  "pagination": {
    "first": {
      "href": "https://localhost/v3/organizations?names=o"
    },
    "last": {
      "href": "https://localhost/v3/organizations?names=o"
    },
    "next": null,
    "previous": null,
    "total_pages": 1,
    "total_results": 1
  },
  "resources": [
    {
      "created_at": "2023-06-30T20:57:36Z",
      "guid": "cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76",
      "links": {
        "self": {
          "href": "https://localhost/v3/organizations/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76"
        }
      },
      "metadata": {
        "annotations": {
          "korifi.cloudfoundry.org/creation-version": "v0.7.2-271-ga6fff152"
        },
        "labels": {}
      },
      "name": "o",
      "relationships": {},
      "suspended": false,
      "updated_at": "2023-06-30T20:57:36Z"
    }
  ]
}


REQUEST: [2023-06-30T13:58:24-07:00]
DELETE /v3/organizations/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 202 Accepted
Content-Length: 0
Date: Fri, 30 Jun 2023 20:58:24 GMT
Location: https://localhost/v3/jobs/org.delete~cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
Server: envoy
X-Correlation-Id: 0b708888-d59e-402e-a072-07511a855f52
X-Envoy-Upstream-Service-Time: 11

REQUEST: [2023-06-30T13:58:24-07:00]
GET /v3/jobs/org.delete~cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 404 Not Found
Content-Type: application/json
Date: Fri, 30 Jun 2023 20:58:24 GMT
Server: envoy
Vary: Accept-Encoding
X-Correlation-Id: 6ae99bec-1e6a-47c1-975f-8c86eb0cd0b9
X-Envoy-Upstream-Service-Time: 9
{
  "errors": [
    {
      "code": 10010,
      "detail": "Job not found. Ensure it exists and you have access to it.",
      "title": "CF-ResourceNotFound"
    }
  ]
}


Job not found. Ensure it exists and you have access to it.
FAILED

apiVersion: korifi.cloudfoundry.org/v1alpha1
kind: CFOrg
metadata:
  annotations:
    korifi.cloudfoundry.org/creation-version: v0.7.2-271-ga6fff152
  creationTimestamp: "2023-06-30T20:57:36Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-06-30T20:58:24Z"
  finalizers:
  - cfOrg.korifi.cloudfoundry.org
  generation: 2
  name: cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
  namespace: cf
  resourceVersion: "2297"
  uid: ff51f747-dc2c-4760-bbfb-902e700c4c88
spec:
  displayName: o
status:
  conditions:
  - lastTransitionTime: "2023-06-30T20:57:36Z"
    message: ""
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  guid: cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
  observedGeneration: 2

I tried again with a new org with a space and an app pushed to check the API logs. When I deleted the org, I received the same output as above, and saw this in the logs:

{"severity":"INFO","timestamp":"2023-06-30T21:06:05.457200276Z","logger":"handlers.job.get.handleOrgDelete","caller":"errors/errors.go:33","message":"org not marked for deletion","correlation-id":"c48fbf91-69c5-48f0-a4b6-4c94312c8947","OrgGUID":"cf-org-9579b6a8-d39d-427e-a90e-0d3ac1
0f0095","reason":"job \"org.delete~cf-org-9579b6a8-d39d-427e-a90e-0d3ac10f0095\" not found"}

This could be an etcd eventual consistency problem, but it seemed to be happening too consistently for that.

The other issue I ran into while testing the org deletion job was that the API would return a "COMPLETED" status in the response to (normally) the third call, which would make the CLI stop polling and print OK. However, when we retrieved a list of CFOrg resources with kubectl or watched it in another window with k9s, we could still see the org that had been deleted for about a minute afterward. Switching to the dynamic user client resulted in the /v3/jobs endpoint correctly returning a "PROCESSING" status until after the CFOrg resource disappeared from k9s. The strange was that space deletion did not act the same way. Instead, it would return a "PROCESSING" status until the space disappeared from k9s and then return the "COMPLETED" status.

I would avoid introducing a new client unless we can prove that it makes a difference.

I agree that adding the new client feels unnecessary. The strange thing is that the new dynamic client seems to eliminate both of the behaviours I described above when testing on the AKS cluster. I just tried an alternative approach, which is to update the org's delete handler to wait for up to 5 seconds until it can confirm that the deletion timestamp is set. This appears to solve the problem of the first call to /v3/jobs returning a 404 error but does not solve the problem of the third call returning a "COMPLETED" status even though the org is still visible in k9s.

gcapizzi · 2023-07-03T10:28:27Z

This appears to solve the problem of the first call to /v3/jobs returning a 404 error but does not solve the problem of the third call returning a "COMPLETED" status even though the org is still visible in k9s.

I wouldn't worry about resources still visible in k9s: there's no cache issue that can make things invisible even if they still exist, only make things visible even if they don't exist anymore, so it's probably due to k9s's cache.

If we know for a fact that both clients are uncached, then we shouldn't be calling only one of the two "uncached". We should understand why the dynamic client succeeds where the regular client fails, but it might just be because it's slower. If this is the case, then I'd just add some retries and get rid of the dynamic client altogether.

davewalter · 2023-07-03T21:59:27Z

Julian and I figured out what was causing the /v3/jobs endpoint to return "COMPLETED" early; it corresponded to the deletion of the role binding in the namespace. We ended up creating a new GetOrgForDeletion function in the repository to allow the job handler to wait for the org to actually be deleted.

We also decided to remove the new user client and go with the retry loop to wait for the deletion timestamp to be set in the org/space.

[#2605] Co-authored-by: Dave Walter <[email protected]>

- Refactor org deletion helper to handle all types. [#2604] Co-authored-by: Dave Walter <[email protected]>

api/handlers/job.go

api/repositories/org_repository.go

matt-royal · 2023-07-05T21:09:30Z

I just confirmed that both cf delete-space and cf delete-org work as expected on this PR branch

- Add a retry loop to wait for the deletion timestamp to propagate. - Added a GetOrgUnfiltered function to the org repository to avoid falsely returning a not found error when the role binding is deleted from an org during deletion. - Consolidate CFSpaceRepository and SpaceRepository interfaces. [#2604] [#2605] Co-authored-by: Dave Walter <[email protected]> Co-authored-by: Julian Hjortshoj <[email protected]>

tcdowney reviewed Jun 29, 2023

View reviewed changes

api/handlers/job.go Outdated Show resolved Hide resolved

davewalter force-pushed the issues/2604 branch 2 times, most recently from c308995 to 2340c78 Compare June 30, 2023 01:57

davewalter force-pushed the issues/2604 branch from 2340c78 to 0bf6f4b Compare July 3, 2023 21:53

acosta11 and others added 2 commits July 5, 2023 12:10

Org repository sets deletion time

eb3507f

[#2605] Co-authored-by: Dave Walter <[email protected]>

Implement space deletion job

214e34d

- Refactor org deletion helper to handle all types. [#2604] Co-authored-by: Dave Walter <[email protected]>

davewalter force-pushed the issues/2604 branch from 0bf6f4b to 4d12daa Compare July 5, 2023 19:11

matt-royal requested changes Jul 5, 2023

View reviewed changes

api/handlers/job.go Outdated Show resolved Hide resolved

api/handlers/job.go Outdated Show resolved Hide resolved

api/repositories/org_repository.go Outdated Show resolved Hide resolved

davewalter force-pushed the issues/2604 branch from 4d12daa to 9677715 Compare July 6, 2023 01:20

davewalter requested a review from matt-royal July 6, 2023 01:33

matt-royal approved these changes Jul 6, 2023

View reviewed changes

davewalter merged commit b473e57 into main Jul 6, 2023
6 checks passed

davewalter deleted the issues/2604 branch July 6, 2023 23:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement space deletion job endpoint #2647

Implement space deletion job endpoint #2647

davewalter commented Jun 29, 2023

kieron-dev commented Jun 30, 2023

gcapizzi commented Jun 30, 2023 •

edited

Loading

davewalter commented Jun 30, 2023

gcapizzi commented Jul 3, 2023

davewalter commented Jul 3, 2023

matt-royal commented Jul 5, 2023

Implement space deletion job endpoint #2647

Implement space deletion job endpoint #2647

Conversation

davewalter commented Jun 29, 2023

Is there a related GitHub Issue?

What is this change about?

Does this PR introduce a breaking change?

Acceptance Steps

Tag your pair, your PM, and/or team

kieron-dev commented Jun 30, 2023

gcapizzi commented Jun 30, 2023 • edited Loading

davewalter commented Jun 30, 2023

gcapizzi commented Jul 3, 2023

davewalter commented Jul 3, 2023

matt-royal commented Jul 5, 2023

gcapizzi commented Jun 30, 2023 •

edited

Loading