Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

lelegard · 2024-06-08T23:11:08Z

Description

Something was broken in runner image windows-2022 version 20240603.1.0 , maybe in the VC++ runtime.

When running test programs in a workflow, using a C++ mutex (std::mutex) immediately terminates the application with error 1.

As a consequence, all continuous integration pipelines / non-regression test suites for C++ applications are broken, unusable, as soon as the application uses a mutex.

This is a very serious issue which should be addressed with a high priority.

The problem appears after the upgrade of windows-2022 (aka "windows-latest")

from runner version 2.316.1, image version 20240514.3.0
to runner version 2.317.0, image version 20240603.1.0

Platforms affected

Azure DevOps
GitHub Actions - Standard Runners
GitHub Actions - Larger Runners

Runner images affected

Image version and build link

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

Is it regression?

Last worked on windows-2022 runner version 2.316.1, image version 20240514.3.0

Expected behavior

The C++ applications which are built as part of a workflow should not crash during the subsequent test phase.

Actual behavior

See repro section.

Repro steps

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

The C++ program is quite simple:

#include <mutex>
#include <iostream>
int main()
{
    std::cout << "1" << std::flush << std::endl;
    std::mutex m;
    std::cout << "2" << std::flush << std::endl;
    std::lock_guard<std::mutex> lock(m);
    std::cout << "3" << std::flush << std::endl;
    return EXIT_SUCCESS;
}

Of course, being so simple, this program works well everywhere, including on local Windows development systems.

When executed in a GitHub workflow, starting with Windows runner version 2.317.0, it fails in the lock step. The above-mentioned log contains this:

1
2
Error: Process completed with exit code 1.

Background

The problem initially appears after that upgrade on the project TSDuck where all workflows suddenly failed on Windows platforms.

The project is quite big (~ 350,000 loc, C++). Everything works fine on local Windows development systems. Only the GitHub CI failed. Because of the size of the project and the absence of direct interaction with the GitHub runner, identifying the reason for the failure was quite hard. I spent hours of test repos and run 52 versions of the CI workflow to understand the nature of the problem.

The text was updated successfully, but these errors were encountered:

grafikrobot · 2024-06-09T03:50:50Z

We worked around the problem by switching to using the static msvc runtime (bfgroup/b2@0075644). And plan on staying with the static runtime to avoid this problem again and again. As it's not the first time such botched updates have happened.

rzblue · 2024-06-09T04:25:35Z

Looks like the same as #10004, there are some workarounds in that issue.

lelegard · 2024-06-09T09:56:30Z

@grafikrobot

We worked around the problem by switching to using the static msvc runtime

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

@rzblue

Looks like the same as #10004, there are some workarounds in that issue.

Thanks for pointing this. This is a very recent one too. They opened it while I was chasing the reason for the problem (it took a long time to identify the mutex issue).

mmomtchev · 2024-06-09T11:26:51Z

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

In fact, unless you have total control over your users' machines, you should be grateful to Github for showing you the problem. The expression the Windows DLL hell exists for a reason and the only viable solution when shipping binaries for Windows is to either go static, either ship this DLL yourself. At least when it comes for me, this is the last time I got burned by the MSVC runtime.

paulhiggs · 2024-06-09T11:59:37Z

There has definately been some changes in the 'default construction' of a mutex. Extracts from various recent versions...

14.32.31326

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

14.39.35519

class _Mutex_base { // base class for all mutex types
public:
#ifdef _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#else // ^^^ _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR / !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR vvv
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#endif // ^^^ !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR ^^^

14.40.33807 - this seems to be the latest

#ifdef _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#else // ^^^ defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) / !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) vvv
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#endif // ^^^ !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) ^^^

in 14.32 and 14.39, the default behaviour (when no compiler directive is specified) is

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

Not providing the _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR directive when using 14.40 gives you a different constructor

lelegard · 2024-06-09T12:01:08Z

@mmomtchev

It's worse than that. I tried to install the same VC runtime as used during the build and the problem remains the same, see #10004 (comment)

mmomtchev · 2024-06-09T12:09:48Z

This is normal, if you build with this version of MSVC, you need the new runtime.

See: - actions/runner-images#10020 - actions/runner-images#10004

lelegard · 2024-06-09T16:04:33Z

@mmomtchev,

This is normal, if you build with this version of MSVC, you need the new runtime.

Precisely, I explicitly install on the runner system the MSVC runtime with which I built the code. So, the same runtime is used in compilation and run. But it does not work. Running the application still fails on locking the mutex.

There is an inconsistency somewhere. Either the VCRedist package in the VS tree is not the same as used by the compiler, or there is a mess of already installed MSVC runtime which takes precedence because of a PATH settings. So, this is either an inconsistency in VS or an inconsistency in the GH runner.

mmomtchev · 2024-06-09T19:03:32Z

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime. This would be the same with gcc on Linux. Install an older shared library runtime and build with a new compiler that uses something that does not exist in this older runtime. It won't work. The difference is that on Linux these days you will get an error that says the dynamic linker cannot find its symbols. On Windows you get a crash. Microsoft should definitely find a solution to this problem. The Linux solution has also been retrofitted at a much later date than the original design.

lelegard · 2024-06-09T20:23:54Z

@mmomtchev

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime.

You misinterpreted what I meant. The compiler and the RTL work together, always. The compiler (well, the compilation environment at large) provides header files. These header files contain specific definitions, here the variants of the mutex constructor. The binary of the runtime must be compatible with these definitions. If a new version of the compiler (and compilation environment) introduces a new version of a constructor, the corresponding code must be in the RTL. Therefore, there is a new version of the RTL which comes (somehow) with the compiler.

When you install Visual Studio, the installed tree of files contains a VCRedist package, a package to install on target systems to make sure that they will be able to run applications which are compiled by this compiler.

This is what I mean: When you build with a given version of Visual Studio, the headers which are used during the compilation must be compatible with the provided VCRedist package. This is why, when you package an application, you typically include this VCRedist in the package of your application and you install both at the same time. Thus, you can be confident that your application will work on the target system, even if it is a bit older (not too old up to some point).

This is why, in my test, I explicitly installed the VCRedist that is found in the Visual Studio setup of the GH runner, before running the application (before compiling in fact). The expected result is that the application which is built is compatible with the RTL that we just installed.

And this is what fails... Therefore, something is rotten in the state of GitHub.

mmomtchev · 2024-06-09T21:15:11Z

You misinterpreted what I meant.

Yes, indeed. I agree, they updated the compiler without updating the VCRedist package. However this is also a huge wake-up call for everyone who ships Windows binaries. In my case, it is a Node.js addon that is installed through npm and does not have an installshield. In those cases, the only viable solution is /MT.

…the test See actions/runner-images#10020

RaviAkshintala · 2024-06-12T14:12:54Z

@lelegard We are looking into the issue, we will get back to you after investigating this issue.

… crash on `windows-2022` after MSVC update from 14.39.33519 to 14.40.33807 (#42123) ### Rationale for this change After the `windows-2022` GitHub runner image was updated last week, MATLAB began crashing when running the unit tests in `arrow/matlab/test/tfeather.m` on Windows. As part of the update, VS 2022 was updated from `17.9.34902.65` to `17.10.34928.147` and MSVC was updated from `14.39.33519` to `14.40.33807`. It looks like many other projects have run into this issue as well: 1. actions/runner-images#10004 2. actions/runner-images#10020 The suggested workaround for this crash is to supply the flag `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` when building. ### What changes are included in this PR? 1. Supply `_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` flag when building Arrow C++. ### Are these changes tested? N/A. Existing tests used. ### Are there any user-facing changes? No. * GitHub Issue: #42015 Authored-by: Sarah Gilmore <[email protected]> Signed-off-by: Sarah Gilmore <[email protected]>

RaviAkshintala · 2024-06-14T09:51:34Z

@lelegard

The deployment has completed, could you please try to rerun the workflow.

Image: windows-2022 Version: 20240610.1.0 Included Software: https://github.com/actions/runner-images/blob/win22/20240610.1/images/windows/Windows2022-Readme.md Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F202406[10](https://github.com/RaviAkshintala/gh-runner-lock/actions/runs/9514083011/job/26225508038#step:1:11).1

If you have any issues, please reach out to us.

The problem started with GitHub runner version 2.317.0, image version 20240603.1.0. Said to be fixed now. See actions/runner-images#10020

lelegard · 2024-06-14T17:14:11Z

@RaviAkshintala

Thanks for the update. It works "a bit better" but all JNI applications (Java Native Interface) are still crashing. So, no, the updated runner is not acceptable.

Let me explain: In my project, I have C++ DLL's and C++ executables. There are also Python and Java bindings. All C++ applications now work correctly. Same for Python applications (the Python interpreter successfully loads my DLL when Python applications calls my Python bindings).

However, when Java application calls the Java bindings, loading the DLL fails with this error:

Exception in thread "main" java.lang.UnsatisfiedLinkError: D:\a\tsduck\tsduck\bin\Release-x64\tsduck.dll:
A dynamic link library (DLL) initialization routine failed

Log here: https://github.com/tsduck/tsduck/actions/runs/9518023948/job/26237929549

That is exactly the same symptom as with the previous update: The initialization routines of the DLL loads stuff using non-thread-safe system functions and they use a std::mutex. That is what fails for everyone.

Note that the crash occurs only with 64-bit applications. The 32-bit version works with Java (probably not the same mixture of VC runtime DLL's).

I re-enabled the workaround I implemented earlier, defining _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR, and everything work again, Java / JNI applications included.

So, please consider adding JNI test cases in your validation suites.

MarkCallow · 2024-06-15T06:26:55Z

The expression the Windows DLL hell exists for a reason and the only viable solution when shipping binaries for Windows is to either go static, either ship this DLL yourself. At least when it comes for me, this is the last time I got burned by the MSVC runtime.

This is not foolproof. It is the reason for the continuing Java crashes. JNI modules are built with the latest VC++ and need the latest runtime but Java Temurin contains its own older version of the vcruntime which is loaded by the JVM. When it loads a JNI module, it links the module with the vcruntime it already loaded. When the module attempts to create a mutex it calls the code in the older vcruntime and the JVM crashes. The workaround is to remove the vcruntime from the Temurin installation.

RaviAkshintala · 2024-06-19T04:51:09Z

@lelegard

So, please consider adding JNI test cases in your validation suites.

Thanks for your confirmation, we are closing the issue as completed.

lelegard · 2024-06-19T07:31:28Z

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

MarkCallow · 2024-06-19T08:04:39Z

Thanks for your confirmation, we are closing the issue as completed.

All we've confirmed is that the runner image is still broken. I am therefore in total agreement with @lelegard.

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

mprather · 2024-06-19T16:53:40Z

@lelegard

Thanks for your confirmation, we are closing the issue as completed.

@RaviAkshintala, I don't understand. The confirmation indicated that the updated image is still unreliable and does not offer a stable, reliable build platform. It was clearly not a confirmation that everything is working once again. Why is this closed?

lelegard · 2024-06-19T18:31:43Z

@RaviAkshintala, you initially wrote:

Actually we look into the issue.
Thanks for your confirmation, we are closing the issue as completed.

Then I wrote this:

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

And, after my comment, you edited your previous comment and you removed the sentence "Actually we look into the issue". It is fortunate that the editing history of posts is available to demonstrate this.

Let me say that this is extremely offensive and dishonest.

As @MarkCallow and @mprather confirmed with me, the problem is NOT fixed. Not only you close the issue without a complete fix but you also erase the part of the discussion which exhibits this.

alemuntoni · 2024-06-19T18:55:24Z

I can confirm that the problem is not solved. Please reopen this issue and, please, solve it asap. At least by reverting to the old working runner.

We are experiencing two weeks of broken runners and this is very unprofessional.

RaviAkshintala · 2024-06-20T04:04:28Z

Hi @lelegard We Apologise for the mistake, will look into the carefully.Thanks.

MarkCallow · 2024-06-20T04:12:59Z

If GitHub supported it, the right thing to do with this is mark it as a duplicate of #10004 so there aren't multiple threads of discussion going on.

The description I gave earlier of the JNI failure is what remains of the original problem since deployment of 20240610.1.0. Actually #10055 was opened specifically re the JNI issue. That too, in my view, is a duplicate of #10004.

lelegard · 2024-06-20T08:40:28Z

@RaviAkshintala and all GitHub folks,

Because characterizing the problem was only possible in a GitHub Actions runner context, I had to run many workflows on a copy of a big repo to come to the conclusion of the C++ std::mutex issue.

Because of this problem, which was created by GitHub with a careless, insufficiently tested, upgrade, I burnt all my Actions credits:

You've used 100% of included services for GitHub Actions.
To continue using Actions & Packages uninterrupted, update your spending limit.

This is the first time it happens to me in 11 years of GitHub usage.

GitHub cannot credit back the many hours of my time I lost on this issue (and many others' time as well). However, it would be fare from GitHub to restore my Actions credits. Again, this credit was lost because of a GitHub bug, not for my own usage or the usage of my project.

So, please consider recrediting my Actions quota.

lelegard · 2024-06-27T20:36:54Z

To all, 7 days after complaining that investigating GitHub's problem burnt all my GH Action credits and asking for a refill of the credits, I still got nothing. My GH Actions credit is still zero. All burnt to do what GH should have done: investigating the problem that they created. And the problem is still not fixed. The contempt and disregard of GH for its users seems have no limit.

connorjclark · 2024-06-27T20:59:53Z

@lelegard Did you file a customer support request? Those have always been resolved to my satisfaction. A comment in this thread won't get you the help you seek.

lelegard added bug report needs triage labels Jun 8, 2024

rainyl mentioned this issue Jun 9, 2024

fix auto setup, make VecUChar.toU8List() return a copy of data, upgrade to OpenCV 4.10.0 rainyl/opencv_dart#83

Merged

3 tasks

lelegard mentioned this issue Jun 9, 2024

New unexpected build failures on 20240603.1.0 #10004

Closed

14 tasks

lelegard added a commit to tsduck/tsduck that referenced this issue Jun 9, 2024

Workaround for GitHub runner issue on Windows

644e834

See: - actions/runner-images#10020 - actions/runner-images#10004

ghaith mentioned this issue Jun 10, 2024

Windows github runner broken PLC-lang/rusty#1241

Closed

fspindle added a commit to fspindle/visp that referenced this issue Jun 10, 2024

Workaround for github action issue when using std::mutex by removing …

7f25066

…the test See actions/runner-images#10020

csciguy8 mentioned this issue Jun 11, 2024

CI broken for windows-2022 builds CesiumGS/cesium-native#906

Closed

sgilmore10 mentioned this issue Jun 12, 2024

[MATLAB] Executing tfeather.m test class causes MATLAB to crash on windows-2022 after MSVC update from 14.39.33519 to 14.40.33807 apache/arrow#42015

Closed

lakshminarayana02 assigned RaviAkshintala Jun 12, 2024

RaviAkshintala added the OS: Windows label Jun 12, 2024

obrix mentioned this issue Jun 12, 2024

Windows CI fixed and add basic import test step in CI powsybl/pypowsybl#760

Merged

7 tasks

GeckoEidechse mentioned this issue Jun 12, 2024

Game fails to launch for some players R2Northstar/NorthstarLauncher#700

Closed

RaviAkshintala added awaiting-deployment Code complete; awaiting deployment and/or deployment in progress Area: C/C++ and removed needs triage labels Jun 12, 2024

sgilmore10 mentioned this issue Jun 12, 2024

GH-42015: [MATLAB] Executing tfeather.m test class causes MATLAB to crash on windows-2022 after MSVC update from 14.39.33519 to 14.40.33807 apache/arrow#42123

Merged

isaacbrodsky mentioned this issue Jun 12, 2024

update readme to install from https isaacbrodsky/h3-duckdb#108

Merged

mwaxmonsky mentioned this issue Jun 12, 2024

MSVC build fails NCAR/micm#563

Closed

RaviAkshintala removed the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label Jun 13, 2024

morinim mentioned this issue Jun 13, 2024

CI broken for windows-latest / release build morinim/ultra#16

Closed

ralfkonrad mentioned this issue Jun 13, 2024

Memory access violation on latest MSCV toolset in GitHub Actions runners lballabio/QuantLib#1987

Open

lelegard added a commit to tsduck/tsduck that referenced this issue Jun 14, 2024

Disable workaround for GitHub runner issue

ffda9eb

The problem started with GitHub runner version 2.317.0, image version 20240603.1.0. Said to be fixed now. See actions/runner-images#10020

lelegard mentioned this issue Jun 14, 2024

20240610.1.0: still issues related to Visual Studio update with C++ libraries run through Java #10055

Open

14 tasks

RaviAkshintala closed this as completed Jun 19, 2024

RaviAkshintala reopened this Jun 20, 2024

duncanpo mentioned this issue Jun 27, 2024

Invalid mex file on Windows mathworks/OpenTelemetry-Matlab#130

Open

lelegard mentioned this issue Jun 27, 2024

Errors compiling on Windows with VS 2022 17.6 LTSC tsduck/tsduck#1476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

lelegard commented Jun 8, 2024

grafikrobot commented Jun 9, 2024

rzblue commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

paulhiggs commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

RaviAkshintala commented Jun 12, 2024

RaviAkshintala commented Jun 14, 2024

lelegard commented Jun 14, 2024

MarkCallow commented Jun 15, 2024 •

edited

Loading

RaviAkshintala commented Jun 19, 2024 •

edited

Loading

lelegard commented Jun 19, 2024

MarkCallow commented Jun 19, 2024

mprather commented Jun 19, 2024

lelegard commented Jun 19, 2024

alemuntoni commented Jun 19, 2024

RaviAkshintala commented Jun 20, 2024

MarkCallow commented Jun 20, 2024

lelegard commented Jun 20, 2024

lelegard commented Jun 27, 2024

connorjclark commented Jun 27, 2024

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

Comments

lelegard commented Jun 8, 2024

Description

Platforms affected

Runner images affected

Image version and build link

Is it regression?

Expected behavior

Actual behavior

Repro steps

Background

grafikrobot commented Jun 9, 2024

rzblue commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

paulhiggs commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

lelegard commented Jun 9, 2024

mmomtchev commented Jun 9, 2024

RaviAkshintala commented Jun 12, 2024

RaviAkshintala commented Jun 14, 2024

lelegard commented Jun 14, 2024

MarkCallow commented Jun 15, 2024 • edited Loading

RaviAkshintala commented Jun 19, 2024 • edited Loading

lelegard commented Jun 19, 2024

MarkCallow commented Jun 19, 2024

mprather commented Jun 19, 2024

lelegard commented Jun 19, 2024

alemuntoni commented Jun 19, 2024

RaviAkshintala commented Jun 20, 2024

MarkCallow commented Jun 20, 2024

lelegard commented Jun 20, 2024

lelegard commented Jun 27, 2024

connorjclark commented Jun 27, 2024

MarkCallow commented Jun 15, 2024 •

edited

Loading

RaviAkshintala commented Jun 19, 2024 •

edited

Loading