Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement space deletion job endpoint #2647

Merged
merged 3 commits into from
Jul 6, 2023
Merged

Implement space deletion job endpoint #2647

merged 3 commits into from
Jul 6, 2023

Conversation

davewalter
Copy link
Member

Is there a related GitHub Issue?

#2604

What is this change about?

Return the status of space deletion based on the underlying CFSpace resource state.

Does this PR introduce a breaking change?

No.

Acceptance Steps

See issue.

Tag your pair, your PM, and/or team

Paired with @acosta11

api/handlers/job.go Outdated Show resolved Hide resolved
@davewalter davewalter force-pushed the issues/2604 branch 2 times, most recently from c308995 to 2340c78 Compare June 30, 2023 01:57
@kieron-dev
Copy link
Contributor

I'm concerned about the new dynamic client here. If you're concerned about caching, I believe this is just another non-caching client, as the privileged and user clients here are already not using a cache.

@gcapizzi
Copy link
Contributor

gcapizzi commented Jun 30, 2023

We are pretty sure that both clients are actually uncached. We tried to put GetOrg back in and see if we could make cf delete-org fail by running in a loop, but without success.

Maybe the condition you caught is not due to caching but just general etcd eventual consistency? In that case maybe we should do retries?

I would avoid introducing a new client unless we can prove that it makes a difference.

@davewalter
Copy link
Member Author

Thank you both for taking a look at this. 🙏

I believe this is just another non-caching client, as the privileged and user clients here are already not using a cache.

That was our understanding as well.

Maybe the condition you caught is not due to caching but just general etcd eventual consistency? In that case maybe we should do retries?

The thing that led us to believe that the client used by the API was using a cache was that @matt-royal consistently encountered an issue when testing on a KinD cluster on his M1 Mac where the first GET call to /v3/jobs that the CF CLI makes immediately after the delete request was returning a 404 error which only happens if the deletion timestamp of the specified org is empty. I was not able to reproduce this on my i9 Mac, but @clintyoshimura and I were able to reproduce this on an AKS cluster. I just deployed from the main branch onto a KinD cluster and was finally able to reproduce it locally:

$ cf delete-org o -v; kubectl get -n cf cforg/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 -o yaml
...
Deleting org o as cf-admin...
REQUEST: [2023-06-30T13:58:24-07:00]
GET /v3/organizations?names=o HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 200 OK
Content-Type: application/json
Date: Fri, 30 Jun 2023 20:58:24 GMT
Server: envoy
Vary: Accept-Encoding
X-Correlation-Id: 81578357-659d-44ff-a2db-aedf9a12948f
X-Envoy-Upstream-Service-Time: 9
{
  "pagination": {
    "first": {
      "href": "https://localhost/v3/organizations?names=o"
    },
    "last": {
      "href": "https://localhost/v3/organizations?names=o"
    },
    "next": null,
    "previous": null,
    "total_pages": 1,
    "total_results": 1
  },
  "resources": [
    {
      "created_at": "2023-06-30T20:57:36Z",
      "guid": "cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76",
      "links": {
        "self": {
          "href": "https://localhost/v3/organizations/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76"
        }
      },
      "metadata": {
        "annotations": {
          "korifi.cloudfoundry.org/creation-version": "v0.7.2-271-ga6fff152"
        },
        "labels": {}
      },
      "name": "o",
      "relationships": {},
      "suspended": false,
      "updated_at": "2023-06-30T20:57:36Z"
    }
  ]
}


REQUEST: [2023-06-30T13:58:24-07:00]
DELETE /v3/organizations/cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 202 Accepted
Content-Length: 0
Date: Fri, 30 Jun 2023 20:58:24 GMT
Location: https://localhost/v3/jobs/org.delete~cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
Server: envoy
X-Correlation-Id: 0b708888-d59e-402e-a072-07511a855f52
X-Envoy-Upstream-Service-Time: 11

REQUEST: [2023-06-30T13:58:24-07:00]
GET /v3/jobs/org.delete~cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76 HTTP/1.1
Host: localhost
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.7.1+9c81242.2023-06-15 (go1.20.5; amd64 darwin)
[application/json Content Hidden]

RESPONSE: [2023-06-30T13:58:24-07:00]
HTTP/1.1 404 Not Found
Content-Type: application/json
Date: Fri, 30 Jun 2023 20:58:24 GMT
Server: envoy
Vary: Accept-Encoding
X-Correlation-Id: 6ae99bec-1e6a-47c1-975f-8c86eb0cd0b9
X-Envoy-Upstream-Service-Time: 9
{
  "errors": [
    {
      "code": 10010,
      "detail": "Job not found. Ensure it exists and you have access to it.",
      "title": "CF-ResourceNotFound"
    }
  ]
}


Job not found. Ensure it exists and you have access to it.
FAILED

apiVersion: korifi.cloudfoundry.org/v1alpha1
kind: CFOrg
metadata:
  annotations:
    korifi.cloudfoundry.org/creation-version: v0.7.2-271-ga6fff152
  creationTimestamp: "2023-06-30T20:57:36Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-06-30T20:58:24Z"
  finalizers:
  - cfOrg.korifi.cloudfoundry.org
  generation: 2
  name: cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
  namespace: cf
  resourceVersion: "2297"
  uid: ff51f747-dc2c-4760-bbfb-902e700c4c88
spec:
  displayName: o
status:
  conditions:
  - lastTransitionTime: "2023-06-30T20:57:36Z"
    message: ""
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  guid: cf-org-f00abc7c-e09b-4ee6-a9e5-891859f79f76
  observedGeneration: 2

I tried again with a new org with a space and an app pushed to check the API logs. When I deleted the org, I received the same output as above, and saw this in the logs:

{"severity":"INFO","timestamp":"2023-06-30T21:06:05.457200276Z","logger":"handlers.job.get.handleOrgDelete","caller":"errors/errors.go:33","message":"org not marked for deletion","correlation-id":"c48fbf91-69c5-48f0-a4b6-4c94312c8947","OrgGUID":"cf-org-9579b6a8-d39d-427e-a90e-0d3ac1
0f0095","reason":"job \"org.delete~cf-org-9579b6a8-d39d-427e-a90e-0d3ac10f0095\" not found"}

This could be an etcd eventual consistency problem, but it seemed to be happening too consistently for that.

The other issue I ran into while testing the org deletion job was that the API would return a "COMPLETED" status in the response to (normally) the third call, which would make the CLI stop polling and print OK. However, when we retrieved a list of CFOrg resources with kubectl or watched it in another window with k9s, we could still see the org that had been deleted for about a minute afterward. Switching to the dynamic user client resulted in the /v3/jobs endpoint correctly returning a "PROCESSING" status until after the CFOrg resource disappeared from k9s. The strange was that space deletion did not act the same way. Instead, it would return a "PROCESSING" status until the space disappeared from k9s and then return the "COMPLETED" status.

I would avoid introducing a new client unless we can prove that it makes a difference.

I agree that adding the new client feels unnecessary. The strange thing is that the new dynamic client seems to eliminate both of the behaviours I described above when testing on the AKS cluster. I just tried an alternative approach, which is to update the org's delete handler to wait for up to 5 seconds until it can confirm that the deletion timestamp is set. This appears to solve the problem of the first call to /v3/jobs returning a 404 error but does not solve the problem of the third call returning a "COMPLETED" status even though the org is still visible in k9s.

@gcapizzi
Copy link
Contributor

gcapizzi commented Jul 3, 2023

This appears to solve the problem of the first call to /v3/jobs returning a 404 error but does not solve the problem of the third call returning a "COMPLETED" status even though the org is still visible in k9s.

I wouldn't worry about resources still visible in k9s: there's no cache issue that can make things invisible even if they still exist, only make things visible even if they don't exist anymore, so it's probably due to k9s's cache.

If we know for a fact that both clients are uncached, then we shouldn't be calling only one of the two "uncached". We should understand why the dynamic client succeeds where the regular client fails, but it might just be because it's slower. If this is the case, then I'd just add some retries and get rid of the dynamic client altogether.

@davewalter
Copy link
Member Author

Julian and I figured out what was causing the /v3/jobs endpoint to return "COMPLETED" early; it corresponded to the deletion of the role binding in the namespace. We ended up creating a new GetOrgForDeletion function in the repository to allow the job handler to wait for the org to actually be deleted.

We also decided to remove the new user client and go with the retry loop to wait for the deletion timestamp to be set in the org/space.

acosta11 and others added 2 commits July 5, 2023 12:10
- Refactor org deletion helper to handle all types.

[#2604]

Co-authored-by: Dave Walter <[email protected]>
api/handlers/job.go Outdated Show resolved Hide resolved
api/handlers/job.go Outdated Show resolved Hide resolved
api/repositories/org_repository.go Outdated Show resolved Hide resolved
@matt-royal
Copy link
Member

I just confirmed that both cf delete-space and cf delete-org work as expected on this PR branch

- Add a retry loop to wait for the deletion timestamp to propagate.
- Added a GetOrgUnfiltered function to the org repository to avoid
  falsely returning a not found error when the role binding is deleted
  from an org during deletion.
- Consolidate CFSpaceRepository and SpaceRepository interfaces.

[#2604]
[#2605]

Co-authored-by: Dave Walter <[email protected]>
Co-authored-by: Julian Hjortshoj <[email protected]>
@davewalter davewalter merged commit b473e57 into main Jul 6, 2023
6 checks passed
@davewalter davewalter deleted the issues/2604 branch July 6, 2023 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants