Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many cloud api calls in node-update-controller #442

Open
yussufsh opened this issue Aug 24, 2023 · 10 comments
Open

too many cloud api calls in node-update-controller #442

yussufsh opened this issue Aug 24, 2023 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@yussufsh
Copy link
Contributor

/kind bug
/kind enhancement

What happened?
There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails.

In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy.

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l
27

See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ).

Examples:

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID'
2023-08-19T02:42:36Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8081"}
2023-08-19T02:42:36Z    INFO    setup   starting manager
2023-08-19T02:42:36Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8082"}
2023-08-19T02:42:36Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"}
2023-08-19T02:42:36Z    INFO    Starting EventSource    {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"}
2023-08-19T02:42:36Z    INFO    Starting Controller     {"controller": "node", "controllerGroup": "", "controllerKind": "Node"}
2023-08-19T02:42:36Z    INFO    Starting workers        {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1}
I0819 05:54:42.543016       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 06:54:24.914454       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 17:30:31.360402       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden  &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}}
F0821 05:22:32.216618       1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer

What you expected to happen?
The node-update-controller should not have so many cloud API calls.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
  • Driver version: latest
@k8s-ci-robot
Copy link
Contributor

@yussufsh: The label(s) kind/enhancement cannot be applied, because the repository doesn't have them.

In response to this:

/kind bug
/kind enhancement

What happened?
There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails.

In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy.

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l
27

See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ).

Examples:

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID'
2023-08-19T02:42:36Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8081"}
2023-08-19T02:42:36Z    INFO    setup   starting manager
2023-08-19T02:42:36Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8082"}
2023-08-19T02:42:36Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"}
2023-08-19T02:42:36Z    INFO    Starting EventSource    {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"}
2023-08-19T02:42:36Z    INFO    Starting Controller     {"controller": "node", "controllerGroup": "", "controllerKind": "Node"}
2023-08-19T02:42:36Z    INFO    Starting workers        {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1}
I0819 05:54:42.543016       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 06:54:24.914454       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 17:30:31.360402       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden  &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}}
F0821 05:22:32.216618       1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer

What you expected to happen?
The node-update-controller should not have so many cloud API calls.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
  • Driver version: latest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 24, 2023
@yussufsh
Copy link
Contributor Author

/assign @yussufsh
One solution could be to add a node label as soon as we set the Storage Affinity Policy to false on the pvm instance. Subsequent reconcile calls should check if a particular node has that label. If the label is present no need to call cloud APIs.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@yussufsh
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2024
@yussufsh
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2024
@yussufsh
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2024
@yussufsh
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants