Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboard AWS Scale tests to Boskos #33183

Open
hakuna-matatah opened this issue Jul 31, 2024 · 9 comments
Open

Onboard AWS Scale tests to Boskos #33183

hakuna-matatah opened this issue Jul 31, 2024 · 9 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@hakuna-matatah
Copy link
Contributor

What would you like to be added:

  • Onboard AWS scale test account to Boskos for janitor to kick in and cleanup left out resources.

Why is this needed:
When kubetest2 tear down doesn't fully succeed, it will end up leaking resources and until next run (which is after 24hours) these resources will not be cleaned up.
This is not cost effective to leave out leaked resources until next run.

W0731 10:50:32.362521   50536 executor.go:141] error running task "AutoscalingGroup/nodes.scale-1000.periodic.test-cncf-aws.k8s.io" (6m1s remaining to succeed): error creating AutoScalingGroup: AutoScalingGroup by this name already exists - A group with the name nodes.scale-1000.periodic.test-cncf-aws.k8s.io already exists and is pending delete.  Please wait until Autoscaling completes the deletion process before creating another group with the same name.

Example run - https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1818587581143584768

@hakuna-matatah hakuna-matatah added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 31, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 31, 2024
@hakuna-matatah hakuna-matatah changed the title Onboard to Boskos Onboard AWS Scale tests to Boskos Jul 31, 2024
@dims
Copy link
Member

dims commented Jul 31, 2024

we have to sit on this one until we can figure out automation to bump any given AWS account to be able to run 5k tests as limits need to be increased etc for various services like ec2, needs explicit bumps working with aws support folks etc.

@BenTheElder
Copy link
Member

we have to sit on this one until we can figure out automation to bump any given AWS account to be able to run 5k tests as limits need to be increased etc for various services like ec2, needs explicit bumps working with aws support folks etc.

I don't think so? You can add the existing account to a new boskos pool?

@BenTheElder
Copy link
Member

We haven't done this for the GCP 5k project, we just stuck the one project we have in a dedicated pool of a single project, so it can still make use of boskos's lifecycling features and be rented to multiple jobs.

I recommend also putting any such multiple jobs into a job queue that matches the boskos pool, for too long we have only relied on manual scheduling.

(job_queue_name and job_queue_capacities, not very well documented at the moment but

job_queue_capacities:
and https://pkg.go.dev/sigs.k8s.io/prow/pkg/config)

@BenTheElder
Copy link
Member

/sig scalability k8s-infra testing

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 5, 2024
@ameukam
Copy link
Member

ameukam commented Aug 5, 2024

@dims I think it should be Ok to do this since we are moving the existing scale account under a boskos resource type:

https://github.com/kubernetes/k8s.io/blob/58752c8ab30f15166a1a0228f7fb461dcf0fab2b/infra/gcp/terraform/k8s-infra-prow-build/prow-build/resources/test-pods/boskos-resources-configmap.yaml#L215C1-L218C38

@dims
Copy link
Member

dims commented Aug 5, 2024

@ameukam cool! that sounds better :) having a separate type and then making sure we use that. all i was worried about was that we can't pick a random account and run scale test on it

@BenTheElder
Copy link
Member

Yeah, we definitely don't want to make the entire main pool scale test ready.

We actually have a few pools on GCP like this, e.g. the GPU projects are also special and we're not setting up that quota for every project. We actually even have a secondary scale pool with a few projects for smaller scalability jobs.

I think we can mimic this, just need to add a pool definition with the aws account, make sure the janitor is enabled for that pool, and switch the job to reference the pool. If we roll that out between scheduled runs it should just work without disruptions.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2024
@BenTheElder
Copy link
Member

/lifecycle frozen
we will need this eventually to streamline managing accounts etc

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants