Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severely degraded performance of macos-14-xlarge runners #10098

Open
2 of 14 tasks
shagedorn opened this issue Jun 20, 2024 · 17 comments
Open
2 of 14 tasks

Severely degraded performance of macos-14-xlarge runners #10098

shagedorn opened this issue Jun 20, 2024 · 17 comments
Assignees
Labels
Area: Image administration bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS

Comments

@shagedorn
Copy link

Description

Starting yesterday (June 19th) ~2pm CEST, we see severely degraded performance in all our macos-14-xlarge-runner-based workflows, taking roughly twice the time for each step, leading to timeouts and increased cost (especially since these are the most expensive runners).

We see these ~2x degredations across every step, from checkouts (standard actions/checkout@v4 step) to Ruby setup to Xcode builds and test executions. The pure "Set up job" step in one case took 21s, when it's usually a 1s step.

We had some runs since the issue started that performed normally, but they are rare… most are extremely slow, which means our CI is essentially down. We are irritated that all status reports are green, because these degredations shows in completely standard actions (like checkout) so we are fairly sure this is unrelated to our own setup or code.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 11
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

20240611.1

Is it regression?

Yes, but the same image version seemed to perform ok until it did not…

Expected behavior

Predictable, stable performance across runs.

Actual behavior

Unpredictable and severely degraded performance since yesterday in most runs.

Repro steps

See description.

@erik-bershel erik-bershel added OS: macOS Area: Image administration investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed needs triage labels Jun 20, 2024
@erik-bershel
Copy link
Contributor

Hey @shagedorn!

Please, provide some examples (build links would be okay).

@shagedorn
Copy link
Author

Our repos are private, I have no such links (or I assume they are not useful – are they?).

An example job definition, just stripped of some comments:

name: Deploy Internal

on:
  workflow_dispatch:
  workflow_call:
    inputs:
      bundle-version:
        required: true
        type: string

jobs:

  build_and_upload_internal_build:
    runs-on: macos-14-xlarge
    timeout-minutes: 20

    steps:
    
    - name: Checkout
      uses: actions/checkout@v4

    … other steps

This alone already shows the regression described above, but if desired, I can also include/share a simple Ruby setup step in full which shows the regression very clearly for us.

@erik-bershel
Copy link
Contributor

@shagedorn private links are okay to be provided. We don't have access to what's going on in the private repository, but we can get some useful technical information.
To understand the situation, we would like to collect as much information as possible. Not only how it works now, but also how it worked previously for comparison. To do this, we need examples of past launches for the same work flow.

@shagedorn
Copy link
Author

I see 🙂 So I picked 3 random runs from the last days before the issue occurred. All of them use runner image 20240611.1.

https://github.com/biowink/clue-ios-rebirth/actions/runs/9565398661/job/26368249268
https://github.com/biowink/clue-ios-rebirth/actions/runs/9543717870/job/26300856571
https://github.com/biowink/clue-ios-rebirth/actions/runs/9515963801/job/26231278292

I deem all of them healthy, "Set up job" took around 1s, and our Ruby setup was in the ~25s range. They finished in ~13 minutes.

Here are 3 recent runs that timed out after 20 minutes:

https://github.com/biowink/clue-ios-rebirth/actions/runs/9582869833/job/26422805988
https://github.com/biowink/clue-ios-rebirth/actions/runs/9583761389/job/26425770660
https://github.com/biowink/clue-ios-rebirth/actions/runs/9584509128/job/26428290330

The "Set up job" is 2-3s here, and the Ruby setup is 44-50s. And then things get worse in the more substantial steps within Xcode.

We have not made any workflow changes in between, and we pin all our tools (incl. Xcode) to specific versions and have not made any upgrades.

Thank you for looking into this.

@icecoffin
Copy link

icecoffin commented Jun 20, 2024

We're experiencing the same issue, with almost a 2x increase in runtime. This is actually similar to what we had back then when we were using standard runners, so I wonder if the action is actually running on a standard runner instead of a large runner (although the "Set up job" sections of the runs are the same).

One thing I noticed: if I go to the Runners page in the repository (https://github.com/<organization>/<repository>/actions/runners), it shows "1 available runner" and "Unprovisioned" next to "Larger GitHub-hosted runners." I don't know if this is how it's supposed to be, and I don't know how this page looked before. I was told that there were no organization-wide settings changes that could've affected this.

Screenshot 2024-06-20 at 10 28 04

@AleBorini
Copy link

AleBorini commented Jun 20, 2024

We are in the same exact position here. The performances of all pipelines running on latests versions of MacOS are completely messed up.

This is the run time of on old pipeline while things where actually working:

Screenshot 2024-06-20 at 10 19 52

Compared to the current situation:
Screenshot 2024-06-20 at 10 24 49

On top of the poor performance I tried to revert all the latest commits to main to the last successful run and stripped the pipeline of all the cache to avoid false negatives.

Any idea if it's possible to select a runner image version to run the workflow on? At the current state the MacOS runners are not usable on our side.

Every pro tip is welcome!

Old runs:
https://github.com/LEGO/fabuland/actions/runs/9548243121

Since yesterday:
https://github.com/LEGO/fabuland/actions/runs/9578611013

@erik-bershel
Copy link
Contributor

Hey @AleBorini!

The performances of all pipelines running on latests versions of MacOS are completely messed up.

Does it mean that you see performance degradation on Standard Runners too? Since the same time? After the image update or just from some point in time?

@AleBorini
Copy link

AleBorini commented Jun 20, 2024

Hey @AleBorini!

The performances of all pipelines running on latests versions of MacOS are completely messed up.

Does it mean that you see performance degradation on Standard Runners too? Since the same time? After the image update or just from some point in time?

I just build this version of the same pipeline to verify what is what => https://github.com/LEGO/fabuland/actions/runs/9596143061/job/26462381554

It will run the worflow against every image of MacOS available stored in the matrix. From macos-12 to macos-latest-xlarge. I expect 12 and 13 to fail because of the setup Ruby step but other than I should have a more clear view of what images are not running as the used to.

@AleBorini
Copy link

AleBorini commented Jun 20, 2024

Hey @AleBorini!

The performances of all pipelines running on latests versions of MacOS are completely messed up.

Does it mean that you see performance degradation on Standard Runners too? Since the same time? After the image update or just from some point in time?

To follow up on the issue. As mentioned above I built a pipeline that runs the same steps on different MacOS versions and the results are all over the place.

MacOS 14 runner looks like is the only one useable at the moment. Even simple tasks like yarn install or pod install on MacOS extra large takes double the time.

Confirm that is happening from yesterday morning UK time.

Screenshot 2024-06-20 at 15 12 28

This is the link to the run, I will keep the PR alive for a while => https://github.com/LEGO/fabuland/actions/runs/9596143061

@erik-bershel erik-bershel self-assigned this Jun 20, 2024
@mr-v
Copy link

mr-v commented Jun 21, 2024

@erik-bershel I shared details in a support ticket 2849698.

@TomaszLizer
Copy link

I have observed similar gradual degradation of runner performance.
Runner used in workflow: macos-14-xlarge

I have noticed that yesterday when different repos started timing out.
Screenshot 2024-06-21 at 14 02 24

@Vyazovoy
Copy link

We're experiencing the same issue. During the last two weeks our workflow time climbed from ~24 minutes to ~56 minutes.

@AleBorini
Copy link

AleBorini commented Jun 21, 2024

Im still investigating on our side but also our release pipelines are affected. The run time has doubled running on MacOS 14 extra large.

Screenshot 2024-06-21 at 15 58 08

The inconsistency is the real issue. Simple commands like yarn install or pod install take forever and with extremely different results.

@shagedorn
Copy link
Author

@erik-bershel can you please leave a status update? I received a comment update email Friday night, but it seems this was deleted since. I started seeing normal runtimes again this morning, but the lack of any updates, either here or in form of an incident on https://www.githubstatus.com, doesn't exactly create confidence.

This was a pretty large outage which you (= GitHub), according to your since-deleted comment, saw in your own monitoring. Why does it take so long to be acknowledged? It seems that all affected users either had unusable CI systems or essentially paid double (sometimes more) for a degraded service, for 1-2 days.

@erik-bershel
Copy link
Contributor

Hey @shagedorn!

I understand your concern - this is a very important point for users. 💯💯💯
Now I am awaiting information from colleagues - as soon as any details appear, we will immediately share them with you.

@AleBorini
Copy link

I can confirm the run time seems to be improved from Friday. Im running more tests on my side to check if all our MacOS pipelines are back to normal.

This is an example of our QA release build:
Screenshot 2024-06-24 at 10 22 21

@sarahbarili
Copy link

During the past week, certain customers utilizing macOS runners may have noticed decreased performance when running Actions workflows. This issue arose due to the simultaneous implementation of a macOS update and a network driver update. Upon identifying the affected configurations, we reverted the network driver to its previous version and will ensure to avoid such conflicts in future updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Image administration bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: macOS
Projects
None yet
Development

No branches or pull requests

8 participants