Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in System.Collections.Concurrent test #112351

Open
MichalStrehovsky opened this issue Feb 10, 2025 · 8 comments
Open

Crash in System.Collections.Concurrent test #112351

MichalStrehovsky opened this issue Feb 10, 2025 · 8 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@MichalStrehovsky
Copy link
Member

Seen on ARM64 Linux at https://dev.azure.com/dnceng-public/public/_build/results?buildId=945755&view=logs&j=2f6a7d26-0d60-5ade-d191-981fe0847989&t=0d559cd2-9dc1-5e64-75a8-1eaf78bd722a

At the time of filing, the dump is available with:

runfo get-helix-payload -j bbbf4d04-8cc5-4acc-8c75-a79e1dff6ccb -w System.Collections.Concurrent.Tests -o c:\helix\bbbf4d04-8cc5-4acc-8c75-a79e1dff6ccb\System.Collections.Concurrent.Tests\
(lldb) bt
* thread #1, name = 'System.Collecti', stop reason = signal SIGSEGV: address not mapped to object
  * frame #0: 0x0000c161fb8d708c System.Collections.Concurrent.Tests`MethodTable::Validate(bool) [inlined] MethodTable::HasComponentSize(this=0x0000007200000000) at MethodTable.h:228:25 [opt]
    frame #1: 0x0000c161fb8d708c System.Collections.Concurrent.Tests`MethodTable::Validate(this=0x0000007200000000, assertOnFail=<unavailable>) at MethodTable.cpp:29:9 [opt]
    frame #2: 0x0000c161fb8a3d80 System.Collections.Concurrent.Tests`WKS::CObjectHeader::Validate(int, int, int) [inlined] MethodTable::SanityCheck(this=<unavailable>) at MethodTable.h:296:40 [opt]
    frame #3: 0x0000c161fb8a3d7c System.Collections.Concurrent.Tests`WKS::CObjectHeader::Validate(this=0x0000f27a1f343a64, bDeep=NO, bVerifyNextHeader=<unavailable>, bVerifySyncBlock=<unavailable>) at gc.cpp:4778:9 [opt]
    frame #4: 0x0000c161fb8bc034 System.Collections.Concurrent.Tests`WKS::GCHeap::Relocate(ppObject=0x0000f2796560d4d8, sc=<unavailable>, flags=0) at gc.cpp:50011:39 [opt]
    frame #5: 0x0000c161fb8e6e60 System.Collections.Concurrent.Tests`GcInfoDecoder::EnumerateLiveSlots(this=0x0000f2797a20bc48, pRD=<unavailable>, reportScratchSlots=<unavailable>, inputFlags=1, pCallBack=<unavailable>, hCallBack=<unavailable>) at gcinfodecoder.cpp:1016:21 [opt]
    frame #6: 0x0000c161fb8e9820 System.Collections.Concurrent.Tests`UnixNativeCodeManager::EnumGcRefs(this=<unavailable>, pMethodInfo=0x0000f2797a20c198, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, hCallback=0x0000f2797a20bd00, isActiveStackFrame=true) at UnixNativeCodeManager.cpp:237:18 [opt]
    frame #7: 0x0000c161fb88ea1c System.Collections.Concurrent.Tests`EnumGcRefs(pCodeManager=<unavailable>, pMethodInfo=<unavailable>, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Relocate(Object**, ScanContext*, unsigned int) at gc.cpp:49983), pvCallbackData=0x0000f2797a20c330, isActiveStackFrame=<unavailable>) at GcEnum.cpp:139:19 [opt]
    frame #8: 0x0000c161fb895098 System.Collections.Concurrent.Tests`Thread::GcScanRootsWorker(this=0x0000f2796560f7e8, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Relocate(Object**, ScanContext*, unsigned int) at gc.cpp:49983), pvCallbackData=0x0000f2797a20c330, frameIterator=0x0000f2797a20c028) at thread.cpp:519:13 [opt]
    frame #9: 0x0000c161fb894da0 System.Collections.Concurrent.Tests`Thread::GcScanRoots(this=0x0000f2796560f7e8, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Relocate(Object**, ScanContext*, unsigned int) at gc.cpp:49983), pvCallbackData=0x0000f2797a20c330) at thread.cpp:419:5 [opt]
    frame #10: 0x0000c161fb88d750 System.Collections.Concurrent.Tests`GCToEEInterface::GcScanRoots(fn=(System.Collections.Concurrent.Tests`WKS::GCHeap::Relocate(Object**, ScanContext*, unsigned int) at gc.cpp:49983), condemned=<unavailable>, max_gen=<unavailable>, sc=0x0000f2797a20c330) at gcenv.ee.cpp:122:22 [opt]
    frame #11: 0x0000c161fb8c49a0 System.Collections.Concurrent.Tests`WKS::gc_heap::relocate_phase(condemned_gen_number=0, first_condemned_address=<unavailable>) at gc.cpp:37248:5 [opt]
    frame #12: 0x0000c161fb8b3b8c System.Collections.Concurrent.Tests`WKS::gc_heap::plan_phase(condemned_gen_number=0) at gc.cpp:34543:9 [opt]
    frame #13: 0x0000c161fb8ac848 System.Collections.Concurrent.Tests`WKS::gc_heap::gc1() at gc.cpp:22699:13 [opt]
    frame #14: 0x0000c161fb8b7750 System.Collections.Concurrent.Tests`WKS::gc_heap::garbage_collect(n=0) at gc.cpp:0:21 [opt]
    frame #15: 0x0000c161fb8a7a78 System.Collections.Concurrent.Tests`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:51499:9 [opt]
    frame #16: 0x0000c161fb8a8d60 System.Collections.Concurrent.Tests`WKS::gc_heap::trigger_gc_for_alloc(gen_number=0, gr=reason_alloc_soh, msl=<unavailable>, loh_p=<unavailable>, take_state=<unavailable>) at gc.cpp:19191:14 [opt]
    frame #17: 0x0000c161fb8a9af8 System.Collections.Concurrent.Tests`WKS::gc_heap::try_allocate_more_space(acontext=0x0000f2797a20f7f0, size=24, flags=0, gen_number=0) at gc.cpp:19329:34 [opt]
    frame #18: 0x0000c161fb8d1cd4 System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) [inlined] WKS::gc_heap::allocate_more_space(acontext=0x0000f2797a20f7f0, size=24, flags=0, alloc_generation_number=0) at gc.cpp:19829:18 [opt]
    frame #19: 0x0000c161fb8d1cc0 System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) [inlined] WKS::gc_heap::allocate(jsize=24, acontext=0x0000f2797a20f7f0, flags=0) at gc.cpp:19860:19 [opt]
    frame #20: 0x0000c161fb8d1c98 System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(this=0x0000c1621b431d60, context=0x0000f2797a20f7f0, size=24, flags=0) at gc.cpp:50404:34 [opt]
    frame #21: 0x0000c161fb88c1c0 System.Collections.Concurrent.Tests`GcAllocInternal(pEEType=0x0000c161fc8f7550, uFlags=0, numElements=0, pThread=0x0000f2797a20f7e8) at GCHelpers.cpp:606:53 [opt]
    frame #22: 0x0000c161fb8ef09c System.Collections.Concurrent.Tests`RhpNewObject at AllocFast.S:88
    frame #23: 0x0000c161fc4ddf14 System.Collections.Concurrent.Tests`xunit_assert_Xunit_Assert__Equal_11<Int32>(expected=<unavailable>, actual=<unavailable>, comparer=0x0000f27a20bfe668) at EqualityAsserts.cs:152
    frame #24: 0x0000c161fb9fa7cc System.Collections.Concurrent.Tests`System_Collections_Concurrent_Tests_System_Collections_Concurrent_Tests_ConcurrentDictionaryTests__TestUpdate(cLevel=5, threads=5, updatesPerThread=<unavailable>) at ConcurrentDictionaryTests.cs:230
    frame #25: 0x0000c161fc605d4c System.Collections.Concurrent.Tests`Internal_CompilerGenerated__Module___<DynamicInvoke>Static<S_P_CoreLib_System_Void__Int32__Int32__Int32> + 92
    frame #26: 0x0000c161fbc9b480 System.Collections.Concurrent.Tests`S_P_CoreLib_System_Reflection_DynamicInvokeInfo__InvokeWithFewArguments(this=0x0000f27a1e12d090, methodToCall=<unavailable>, thisArg=<unavailable>, ret=<unavailable>, parameters=0x0000f27a1e12c440, binderBundle=<unavailable>, wrapInTargetInvocationException=true) at DynamicInvokeInfo.cs:504
    frame #27: 0x0000c161fbc9af30 System.Collections.Concurrent.Tests`S_P_CoreLib_System_Reflection_DynamicInvokeInfo__Invoke(this=0x0000f27a1e12d090, thisPtr=<unavailable>, methodToCall=<unavailable>, parameters=<unavailable>, binderBundle=<unavailable>, wrapInTargetInvocationException=true) at DynamicInvokeInfo.cs:245
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Feb 10, 2025
@jkotas
Copy link
Member

jkotas commented Feb 10, 2025

Looks like bad GCInfo for Monitor::Exit:

System_Collections_Concurrent!S_P_CoreLib_System_Threading_Monitor::Exit:
0000c161`fbc31a60 a9bd7bfd stp         fp,lr,[sp,#-0x30]!
0000c161`fbc31a64 a901d3f3 stp         x19,x20,[sp,#0x18]
0000c161`fbc31a68 f90017f5 str         x21,[sp,#0x28]
0000c161`fbc31a6c 910003fd mov         fp,sp
0000c161`fbc31a70 f9000bbf str         xzr,[fp,#0x10]
0000c161`fbc31a74 aa0003f3 mov         x19,x0
0000c161`fbc31a78 b4000e33 cbz         x19,System_Collections_Concurrent!S_P_CoreLib_System_Threading_Monitor::Exit+0x1dc (0000c161`fbc31c3c)
0000c161`fbc31a7c d53bd041 mrs         x1,TPIDR_EL0
0000c161`fbc31a80 d2a00000 movz        x0,#0,lsl #0x10
0000c161`fbc31a84 f2800200 movk        x0,#0x10
0000c161`fbc31a88 d503201f nop
0000c161`fbc31a8c d503201f nop
0000c161`fbc31a90 8b000020 add         x0,x1,x0 <------------------- GC triggered here
0000c161`fbc31a94 f9400014 ldr         x20,[x0]
0000c161`fbc31a98 b4000db4 cbz         x20,System_Collections_Concurrent!S_P_CoreLib_System_Threading_Monitor::Exit+0x1ec (0000c161`fbc31c4c)
0000c161`fbc31a9c b9408a95 ldr         w21,[x20,#0x88]
0000c161`fbc31aa0 510006a0 sub         w0,w21,#1
0000c161`fbc31aa4 2a807eb5 orr         w21,w21,w0,asr #0x1F
0000c161`fbc31aa8 f9000bb3 str         x19,[fp,#0x10]
0000c161`fbc31aac f9400ba0 ldr         x0,[fp,#0x10]
0000c161`fbc31ab0 d1001000 sub         x0,x0,#4
0000c161`fbc31ab4 b9400001 ldr         w1,[x0]
0000c161`fbc31ab8 53003c22 uxth        w2,w1
0000c161`fbc31abc 6b15005f cmp         w2,w21
0000c161`fbc31ac0 54000281 bne         System_Collections_Concurrent!S_P_CoreLib_System_Threading_Monitor::Exit+0xb0 (0000c161`fbc31b10)

We are reporting r20 to the GC, but r20 has not been initialized yet.

@jkotas
Copy link
Member

jkotas commented Feb 10, 2025

JitDump of Monitor.Exit: jitdump.txt

@jkotas
Copy link
Member

jkotas commented Feb 10, 2025

Relevant GCInfo fragment from jitdump:

...
IN0006: 000028      blr     x0
                             ; gcrRegs -[x19] +[x20]
IN0007: 00002C      add     x0, x1, x0
IN0008: 000030      ldr     x20, [x0]
IN0009: 000034      cbz     x20, G_M47640_IG17
...

The GCInfo seems to be marking r20 as live one instruction too early. @dotnet/jit-contrib Could you please take a look?

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-NativeAOT-coreclr labels Feb 10, 2025
@jkotas jkotas added this to the 10.0.0 milestone Feb 10, 2025
@EgorBo
Copy link
Member

EgorBo commented Feb 11, 2025

Looks like it's TLS access expansion cc @kunalspathak

the add instruction seems to be emitted here:

GetEmitter()->emitIns_R_R_R(INS_add, EA_8BYTE, REG_R0, REG_R1, REG_R0);

@kunalspathak kunalspathak self-assigned this Feb 11, 2025
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Feb 11, 2025
@kunalspathak
Copy link
Member

The problem here is that the live GC regs mask is encoded in the reg1 field of the instrDesc, but we overwrite it with the register value to be used for the call, thinking that it was unused. That changes the mask and we end up treating the set register value as the gc-refs mask. I am little surprised why this problem surfaced now, given that it is been there for a year now. I will do some RCA and submit a fix tomorrow.

@EgorBo
Copy link
Member

EgorBo commented Feb 11, 2025

I am little surprised why this problem surfaced now, given that it is been there for a year now. I

The change is NativeAOT only and the GC presumably has to stop precisely on that instruction in a method with a TLS access and the method has to be fully-interruptible, I guess the chances are once-a-year 😆 (perhaps, some recent C# change made that method fully-interruptible only recently?). I guess this could be a motivation to enable GCStress for NAOT

@jkotas
Copy link
Member

jkotas commented Feb 11, 2025

The change is NativeAOT only and the GC presumably has to stop precisely on that instruction in a method with a TLS access and the method has to be fully-interruptible

And we have to be lucky to get a crash dump, and an issue needs to be opened on the crash, and somebody must investigate it before the crash dump is deleted....

@MichalStrehovsky
Copy link
Member Author

We had a similar TLS issue found by the same test, probably in the same method at #102140 last year.

This one was probably rare enough that it never bubbled up visible enough. Since this is always the same test, nothing probably stresses TLS as much as what this test does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
Status: No status
Development

No branches or pull requests

4 participants