Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: aarch64 job is timing out again #16565

Closed
pllim opened this issue Jun 14, 2024 · 21 comments · Fixed by #17024
Closed

TST: aarch64 job is timing out again #16565

pllim opened this issue Jun 14, 2024 · 21 comments · Fixed by #17024

Comments

@pllim
Copy link
Member

pllim commented Jun 14, 2024

Looks like it, again, exceeds 6 hours and got terminated. 😿

Example log: https://github.com/astropy/astropy/actions/runs/9443303697/job/26006816907

@pllim
Copy link
Member Author

pllim commented Sep 12, 2024

Previous incarnation of the same problem:

Previous attempt to bring the run time down (but seems like it cropped back up since):

Thanks to @neutrinoceros for that PR and digging these back out.

@pllim
Copy link
Member Author

pllim commented Sep 12, 2024

@larrybradley asked if we can just not run aarch64 job. I think not because we have outstanding requests for better aarch64 support?

@neutrinoceros
Copy link
Contributor

It may actually be much easier to add aarch64 wheels now than it was 3 years ago when #11731 was open: this is now supported by cibuildwheel. It may not solve the issue of long running tests though, and I imagine we want to keep testing all wheels we publish...
Maybe we should give it a try first, see how long it typically takes, and see if we can manage a way to only test these wheels for nightlies and releases ?

@pllim
Copy link
Member Author

pllim commented Sep 12, 2024

Re: #16565 (comment)

I'll defer to @astropy/astropy-project-release-team

@pllim
Copy link
Member Author

pllim commented Sep 12, 2024

@astrofrog also suggested we split the aarch64 run into smaller chunks across multiple jobs but I am not in favor because:

  • It does not fix the real underlying issue. Why is this one so slow but not the other "exotic" archs in the same setup?
  • We have to break it out of the matrix and manage a more complicated workflow just for this.
  • Does not fix the problem that we have to run the whole thing anyway during wheel testing unless you want to build multiple duplicate wheels or something.

@neutrinoceros
Copy link
Contributor

neutrinoceros commented Sep 13, 2024

Why is this one so slow but not the other "exotic" archs in the same setup?

Could it be a regression in the action we use to emulate this aarch ? I note that the latest version was released on April 11th, roughly 2 weeks before #16405 was opened with the following

(...) it has been such for 2 weeks.

was the culprit right under our nose ? Let's try pinning to the previous version and see if it helps.

update: experiment started on my fork at https://github.com/neutrinoceros/astropy/actions/runs/10844236502/job/30092732422?pr=24

@Cadair
Copy link
Member

Cadair commented Sep 13, 2024

I suggested on slack that we just pay for native aarch64 runners. The cost for a 4 core linux runner is $0.01/minute, so I doubt the total cost would be prohibitive.

@astrofrog and I can run a test build to see how long it would take on native hardware to better estimate the cost.

@saimn
Copy link
Contributor

saimn commented Sep 13, 2024

Looking at timestamps, test collection only takes ~1h20 ... versus ~5min for s390x, and the first test is ran 2h20 after the action start. Not sure what's happening with the aarch64 emulation but doesn't seem usable. It also takes 20 min to build the docker image, which could be avoided by caching it.
So that just a massive waste of resources and splitting in chunks would be worse. If we really want to support this arch I guess paying for native hardware is the way to go.

@astrofrog
Copy link
Member

I've been experimenting with the Linux arm64 GitHub hosted runner, which one needs to pay for. If I run the full test suite with just required deps (basically tox -e test) but including the hypothesis tests, it takes just under 11 minutes - this is on a 2-core runner (can't select 1-core). If I run in parallel with -n 2 the time goes down to 9 minutes. At $0.008/minute, this means each run costs $0.072.

Looking at the last few months, we had the following number of aarch62 runs:

  • September (so far): 65
  • August: 70
  • July: 56

Let's say 100 to be conservative - this means a total cost of $7.2 per month, which seems totally worth it.

The only snag might be that I'm not sure if Astropy can easily have access to this runner, for a couple of reasons:

  • It seems you need a team or enterprise account, and team account is $4 per month per user (which means someone in a team in the org) which would be very expensive for astropy. However, it might be the astropy has a type of account that allows this, but I don't have permissions to check.
  • For some reason, I wasn't able to enable the runner on a public repo, just a private one

@astrofrog
Copy link
Member

astrofrog commented Sep 13, 2024

And for fun, here are all the aarch64 jobs in the last few months:

image

Looks like there was indeed a sudden jump!

@astrofrog
Copy link
Member

The last successful 3 hour run was on 2024-04-25T21:47:35+00:00 and the first 6 hour run happened on 2024-04-26T20:55:55+00:00, so sometime around then something was merged that doubled the CI time!

@astrofrog
Copy link
Member

astrofrog commented Sep 14, 2024

I probably won't have time to look at what was merged in that time range until Monday so if anyone wants to take a stab please feel free to! I think getting back to 3h is clearly the priority if we can.

@Cadair
Copy link
Member

Cadair commented Sep 14, 2024

I am pretty sure astropy and sunpy (etc numfocus) got upgraded to higher github plans for free ages ago so I don't think the upgrade per user thing will be a problem for us. Obviously, the private repo thing needs figuring out.

@neutrinoceros
Copy link
Contributor

so sometime around then something was merged that doubled the CI time!

I would like to point out that the change is not necessarilly something that was merged in astropy.
I think we can reasonably eliminate uraimo/run-on-arch-action from the list of potential suspects (#16565 (comment)) but we haven't ruled out that something changed in the docker image we run tests in. Anyway, what I mean to say is that maybe it's not worth looking for the actual culprit here.

@astrofrog
Copy link
Member

astrofrog commented Sep 14, 2024

Yes indeed - as @Cadair pointed out though, we should just pay for the Linux arm64 runner if we can in any case as the yearly cost (<$100) is less than an hour of developer time and I spent at least this estimating the cost 😅

@Cadair
Copy link
Member

Cadair commented Sep 14, 2024

From poking around in the sunpy org settings pretty sure you can make a "runner group" with specific public repos in. Also you can set spending limits.

@Cadair
Copy link
Member

Cadair commented Sep 15, 2024

I played with setting this up for SunPy: sunpy/streamtracer#167 only real issue was with the reusable workflow

@pllim
Copy link
Member Author

pllim commented Sep 16, 2024

only real issue was with the reusable workflow

Okay, probably not a big deal for us here because currently the "exotic archs" do not use OpenAstronomy stuff, so as long as you don't start now for aarch64, should be okay.

Update from 2025-09-16: Finance Committee approved the funding request for $25 limit per month using this new paid runner. I have enabled it for this repo (and this repo only). @astrofrog will have a PR out soon to use it. 🤞

@larrybradley
Copy link
Member

Any chance that photutils can use this too?
https://github.com/astropy/photutils/actions/runs/10877841286

@pllim
Copy link
Member Author

pllim commented Sep 16, 2024

Re: photutils -- If it is not urgent, I would like to see how much money it really costs first for astropy for a few weeks (or maybe even during feature freeze craziness) before extending it to other repos.

@pllim
Copy link
Member Author

pllim commented Sep 17, 2024

@larrybradley , I opened astropy/photutils#1883 for photutils as a follow-up issue. Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants