Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

Open
2 of 14 tasks
lelegard opened this issue Jun 8, 2024 · 26 comments
Open
2 of 14 tasks

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

lelegard opened this issue Jun 8, 2024 · 26 comments

Comments

@lelegard
Copy link

lelegard commented Jun 8, 2024

Description

Something was broken in runner image windows-2022 version 20240603.1.0 , maybe in the VC++ runtime.

When running test programs in a workflow, using a C++ mutex (std::mutex) immediately terminates the application with error 1.

As a consequence, all continuous integration pipelines / non-regression test suites for C++ applications are broken, unusable, as soon as the application uses a mutex.

This is a very serious issue which should be addressed with a high priority.

The problem appears after the upgrade of windows-2022 (aka "windows-latest")

  • from runner version 2.316.1, image version 20240514.3.0
  • to runner version 2.317.0, image version 20240603.1.0

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 11
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

Is it regression?

Last worked on windows-2022 runner version 2.316.1, image version 20240514.3.0

Expected behavior

The C++ applications which are built as part of a workflow should not crash during the subsequent test phase.

Actual behavior

See repro section.

Repro steps

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

The C++ program is quite simple:

#include <mutex>
#include <iostream>
int main()
{
    std::cout << "1" << std::flush << std::endl;
    std::mutex m;
    std::cout << "2" << std::flush << std::endl;
    std::lock_guard<std::mutex> lock(m);
    std::cout << "3" << std::flush << std::endl;
    return EXIT_SUCCESS;
}

Of course, being so simple, this program works well everywhere, including on local Windows development systems.

When executed in a GitHub workflow, starting with Windows runner version 2.317.0, it fails in the lock step. The above-mentioned log contains this:

1
2
Error: Process completed with exit code 1.

Background

The problem initially appears after that upgrade on the project TSDuck where all workflows suddenly failed on Windows platforms.

The project is quite big (~ 350,000 loc, C++). Everything works fine on local Windows development systems. Only the GitHub CI failed. Because of the size of the project and the absence of direct interaction with the GitHub runner, identifying the reason for the failure was quite hard. I spent hours of test repos and run 52 versions of the CI workflow to understand the nature of the problem.

@grafikrobot
Copy link

We worked around the problem by switching to using the static msvc runtime (bfgroup/b2@0075644). And plan on staying with the static runtime to avoid this problem again and again. As it's not the first time such botched updates have happened.

@rzblue
Copy link

rzblue commented Jun 9, 2024

Looks like the same as #10004, there are some workarounds in that issue.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@grafikrobot

We worked around the problem by switching to using the static msvc runtime

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

@rzblue

Looks like the same as #10004, there are some workarounds in that issue.

Thanks for pointing this. This is a very recent one too. They opened it while I was chasing the reason for the problem (it took a long time to identify the mutex issue).

@mmomtchev
Copy link

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

In fact, unless you have total control over your users' machines, you should be grateful to Github for showing you the problem. The expression the Windows DLL hell exists for a reason and the only viable solution when shipping binaries for Windows is to either go static, either ship this DLL yourself. At least when it comes for me, this is the last time I got burned by the MSVC runtime.

@paulhiggs
Copy link

There has definately been some changes in the 'default construction' of a mutex. Extracts from various recent versions...

14.32.31326

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

14.39.35519

class _Mutex_base { // base class for all mutex types
public:
#ifdef _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#else // ^^^ _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR / !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR vvv
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#endif // ^^^ !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR ^^^

14.40.33807 - this seems to be the latest

#ifdef _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#else // ^^^ defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) / !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) vvv
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#endif // ^^^ !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) ^^^

in 14.32 and 14.39, the default behaviour (when no compiler directive is specified) is

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

Not providing the _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR directive when using 14.40 gives you a different constructor

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev

It's worse than that. I tried to install the same VC runtime as used during the build and the problem remains the same, see #10004 (comment)

@mmomtchev
Copy link

This is normal, if you build with this version of MSVC, you need the new runtime.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev,

This is normal, if you build with this version of MSVC, you need the new runtime.

Precisely, I explicitly install on the runner system the MSVC runtime with which I built the code. So, the same runtime is used in compilation and run. But it does not work. Running the application still fails on locking the mutex.

There is an inconsistency somewhere. Either the VCRedist package in the VS tree is not the same as used by the compiler, or there is a mess of already installed MSVC runtime which takes precedence because of a PATH settings. So, this is either an inconsistency in VS or an inconsistency in the GH runner.

@mmomtchev
Copy link

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime. This would be the same with gcc on Linux. Install an older shared library runtime and build with a new compiler that uses something that does not exist in this older runtime. It won't work. The difference is that on Linux these days you will get an error that says the dynamic linker cannot find its symbols. On Windows you get a crash. Microsoft should definitely find a solution to this problem. The Linux solution has also been retrofitted at a much later date than the original design.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime.

You misinterpreted what I meant. The compiler and the RTL work together, always. The compiler (well, the compilation environment at large) provides header files. These header files contain specific definitions, here the variants of the mutex constructor. The binary of the runtime must be compatible with these definitions. If a new version of the compiler (and compilation environment) introduces a new version of a constructor, the corresponding code must be in the RTL. Therefore, there is a new version of the RTL which comes (somehow) with the compiler.

When you install Visual Studio, the installed tree of files contains a VCRedist package, a package to install on target systems to make sure that they will be able to run applications which are compiled by this compiler.

This is what I mean: When you build with a given version of Visual Studio, the headers which are used during the compilation must be compatible with the provided VCRedist package. This is why, when you package an application, you typically include this VCRedist in the package of your application and you install both at the same time. Thus, you can be confident that your application will work on the target system, even if it is a bit older (not too old up to some point).

This is why, in my test, I explicitly installed the VCRedist that is found in the Visual Studio setup of the GH runner, before running the application (before compiling in fact). The expected result is that the application which is built is compatible with the RTL that we just installed.

And this is what fails... Therefore, something is rotten in the state of GitHub.

@mmomtchev
Copy link

You misinterpreted what I meant.

Yes, indeed. I agree, they updated the compiler without updating the VCRedist package. However this is also a huge wake-up call for everyone who ships Windows binaries. In my case, it is a Node.js addon that is installed through npm and does not have an installshield. In those cases, the only viable solution is /MT.

@RaviAkshintala
Copy link

@lelegard We are looking into the issue, we will get back to you after investigating this issue.

sgilmore10 added a commit to apache/arrow that referenced this issue Jun 12, 2024
… crash on `windows-2022` after MSVC update from 14.39.33519 to 14.40.33807 (#42123)

### Rationale for this change

After the `windows-2022` GitHub runner image was updated last week, MATLAB began crashing when running the unit tests in `arrow/matlab/test/tfeather.m` on Windows. As part of the update, VS 2022 was updated from 
`17.9.34902.65` to `17.10.34928.147` and MSVC was updated from `14.39.33519` to `14.40.33807`. 

It looks like many other projects have run into this issue as well:

1. actions/runner-images#10004
2. actions/runner-images#10020

The suggested workaround for this crash is to supply  the flag `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` when building. 

### What changes are included in this PR?

1. Supply `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` flag when building  Arrow C++.

### Are these changes tested?

N/A. Existing tests used.

### Are there any user-facing changes?

No.
* GitHub Issue: #42015

Authored-by: Sarah Gilmore <[email protected]>
Signed-off-by: Sarah Gilmore <[email protected]>
@RaviAkshintala RaviAkshintala removed the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label Jun 13, 2024
@RaviAkshintala
Copy link

@lelegard

The deployment has completed, could you please try to rerun the workflow.

Image: windows-2022 Version: 20240610.1.0 Included Software: https://github.com/actions/runner-images/blob/win22/20240610.1/images/windows/Windows2022-Readme.md Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F202406[10](https://github.com/RaviAkshintala/gh-runner-lock/actions/runs/9514083011/job/26225508038#step:1:11).1

If you have any issues, please reach out to us.

lelegard added a commit to tsduck/tsduck that referenced this issue Jun 14, 2024
The problem started with GitHub runner version 2.317.0, image version 20240603.1.0.
Said to be fixed now.
See actions/runner-images#10020
@lelegard
Copy link
Author

@RaviAkshintala

Thanks for the update. It works "a bit better" but all JNI applications (Java Native Interface) are still crashing. So, no, the updated runner is not acceptable.

Let me explain: In my project, I have C++ DLL's and C++ executables. There are also Python and Java bindings. All C++ applications now work correctly. Same for Python applications (the Python interpreter successfully loads my DLL when Python applications calls my Python bindings).

However, when Java application calls the Java bindings, loading the DLL fails with this error:

Exception in thread "main" java.lang.UnsatisfiedLinkError: D:\a\tsduck\tsduck\bin\Release-x64\tsduck.dll:
A dynamic link library (DLL) initialization routine failed

Log here: https://github.com/tsduck/tsduck/actions/runs/9518023948/job/26237929549

That is exactly the same symptom as with the previous update: The initialization routines of the DLL loads stuff using non-thread-safe system functions and they use a std::mutex. That is what fails for everyone.

Note that the crash occurs only with 64-bit applications. The 32-bit version works with Java (probably not the same mixture of VC runtime DLL's).

I re-enabled the workaround I implemented earlier, defining _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR, and everything work again, Java / JNI applications included.

So, please consider adding JNI test cases in your validation suites.

@MarkCallow
Copy link

MarkCallow commented Jun 15, 2024

The expression the Windows DLL hell exists for a reason and the only viable solution when shipping binaries for Windows is to either go static, either ship this DLL yourself. At least when it comes for me, this is the last time I got burned by the MSVC runtime.

This is not foolproof. It is the reason for the continuing Java crashes. JNI modules are built with the latest VC++ and need the latest runtime but Java Temurin contains its own older version of the vcruntime which is loaded by the JVM. When it loads a JNI module, it links the module with the vcruntime it already loaded. When the module attempts to create a mutex it calls the code in the older vcruntime and the JVM crashes. The workaround is to remove the vcruntime from the Temurin installation.

@RaviAkshintala
Copy link

RaviAkshintala commented Jun 19, 2024

@lelegard

So, please consider adding JNI test cases in your validation suites.

Thanks for your confirmation, we are closing the issue as completed.

@lelegard
Copy link
Author

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

@MarkCallow
Copy link

Thanks for your confirmation, we are closing the issue as completed.

All we've confirmed is that the runner image is still broken. I am therefore in total agreement with @lelegard.

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

@mprather
Copy link

@lelegard

Thanks for your confirmation, we are closing the issue as completed.

@RaviAkshintala, I don't understand. The confirmation indicated that the updated image is still unreliable and does not offer a stable, reliable build platform. It was clearly not a confirmation that everything is working once again. Why is this closed?

@lelegard
Copy link
Author

@RaviAkshintala, you initially wrote:

Actually we look into the issue.
Thanks for your confirmation, we are closing the issue as completed.

Then I wrote this:

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

And, after my comment, you edited your previous comment and you removed the sentence "Actually we look into the issue". It is fortunate that the editing history of posts is available to demonstrate this.

Let me say that this is extremely offensive and dishonest.

As @MarkCallow and @mprather confirmed with me, the problem is NOT fixed. Not only you close the issue without a complete fix but you also erase the part of the discussion which exhibits this.

@alemuntoni
Copy link

I can confirm that the problem is not solved. Please reopen this issue and, please, solve it asap. At least by reverting to the old working runner.

We are experiencing two weeks of broken runners and this is very unprofessional.

@RaviAkshintala
Copy link

Hi @lelegard We Apologise for the mistake, will look into the carefully.Thanks.

@MarkCallow
Copy link

If GitHub supported it, the right thing to do with this is mark it as a duplicate of #10004 so there aren't multiple threads of discussion going on.

The description I gave earlier of the JNI failure is what remains of the original problem since deployment of 20240610.1.0. Actually #10055 was opened specifically re the JNI issue. That too, in my view, is a duplicate of #10004.

@lelegard
Copy link
Author

@RaviAkshintala and all GitHub folks,

Because characterizing the problem was only possible in a GitHub Actions runner context, I had to run many workflows on a copy of a big repo to come to the conclusion of the C++ std::mutex issue.

Because of this problem, which was created by GitHub with a careless, insufficiently tested, upgrade, I burnt all my Actions credits:

You've used 100% of included services for GitHub Actions.
To continue using Actions & Packages uninterrupted, update your spending limit.

This is the first time it happens to me in 11 years of GitHub usage.

GitHub cannot credit back the many hours of my time I lost on this issue (and many others' time as well). However, it would be fare from GitHub to restore my Actions credits. Again, this credit was lost because of a GitHub bug, not for my own usage or the usage of my project.

So, please consider recrediting my Actions quota.

@lelegard
Copy link
Author

To all, 7 days after complaining that investigating GitHub's problem burnt all my GH Action credits and asking for a refill of the credits, I still got nothing. My GH Actions credit is still zero. All burnt to do what GH should have done: investigating the problem that they created. And the problem is still not fixed. The contempt and disregard of GH for its users seems have no limit.

@connorjclark
Copy link

@lelegard Did you file a customer support request? Those have always been resolved to my satisfaction. A comment in this thread won't get you the help you seek.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants