-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM] Segfault in mkfit::MkBuilder::findTracksCloneEngine #42071
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
FYI @cms-sw/tracking-pog-l2 (I'm mostly documenting the stack trace) |
type tracking |
@makortel |
nested tbb parallel for: usually just overhead as there are no free threads in a standard CMSSW job. |
There is |
maybe running a tracking only wf (maybe just one iteration) with one stream and 8 (or more) threads (if cores available) |
One gets 10 core on lxplus8-arm ( It's a (partion of a ) Neoverse N1) |
Probably won’t reproduce. The Cavium cores at the openlab seem to be much more aggressive at taking advantage of the relaxed ARM memory model, uncovering bugs that can’t be reproduced on other ARM implementations where the memory model implementation isn’t as relaxed as the spec allows (I’ve tried on an Apple M1). |
Hmmh, the actual crash seems to be in There is a to-do note in MkFitProuducer where pools of shared helper objects are populated: What's the priority on this? |
We had updated oneTBB to 2021.9.0 a few IBs before, and there was another crash elsewhere in TBB code (#42093, on x86).
As an isolated crash on ARM not particularly urgent (although we just observed another occurrance). I'm honestly now suspecting the TBB update, but would like to see a bit more evidence before rolling it back. |
We are going to roll back TBB to 2021.8.0 for 13_2_0_pre3, and I opened also an issue in uxlfoundation/oneTBB#1139 . Let's see if this reproduces after the rollback, or after we re-update to 2021.9.0 in 13_3_X. |
Just for the record on the related crashes, although probably not needed anymore after the revert. There is a new occurrence in workflow 136.874 (step 2) in CMSSW_13_2_X_2023-06-26-2300 on el8_aarch64_gcc1:
|
Another occurrence in CMSSW_13_2_X_2023-06-28-2300 on el8_aarch64_gcc11 |
A similar crash occurred in CMSSW_13_3_X_2023-08-08-2300 on el8_aarch64_gcc11 workflow 136.804 step 3. This time there is also a printout from TCMalloc
|
@makortel , we are using couple of years old version of tcmalloc ( https://github.com/gperftools/gperftools/tree/gperftools-2.9.1 ) which is provided via gperftools. There is a new version https://github.com/gperftools/gperftools/tree/gperftools-2.10.80 available ( testing via cms-sw/cmsdist#8635 ) . |
ah the newer version of gpreftools still contains two years old tcmalloc ( https://github.com/gperftools/gperftools/tree/gperftools-2.10.80/src/google ) |
the last crash stack backtrace is different; one of the last points is |
I don't think there's anything exotic (like concurrency-related) in the cmssw/RecoTracker/MkFitCore/src/MkBuilder.cc Line 977 in 2e956e4
|
Another one in CMSSW_13_3_X_2023-09-22-2300 on el8_aarch64_gcc11
|
One could try to replace the tbb container with a simpler one (maybe less flexible) |
Hello, RelVal
|
cms-bot internal usage |
sorry, how is this similar? |
Workflow 136.813 step 3 segfaulted in CMSSW_13_2_X_2023-06-22-2300 on el8_aarch64_gcc11 with
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc11/CMSSW_13_2_X_2023-06-22-2300/pyRelValMatrixLogs/run/136.813_RunZeroBias2017D/step3_RunZeroBias2017D.log#/
The text was updated successfully, but these errors were encountered: