-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes in run 388769 and 388770: InvalidReference
exception involving DetSetVector::inserv
called with index already in collection;
#46783
Comments
cms-bot internal usage |
A new Issue was created by @mmusich. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Some of the error files from those runs can be found at
Below is a reproducer tested on #!/bin/bash
# cmsrel CMSSW_14_1_5_patch2
# cd CMSSW_14_1_5_patch2/src
# cmsenv
hltLabel=hlt
hltMenu=run:388769
globalTag=141X_dataRun3_HLT_v1
hltGetConfiguration \
"${hltMenu}" \
--globaltag "${globalTag}" \
--data \
--no-prescale \
--no-output \
--max-events 1 \
--input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run388769/run388769_ls0186_index000175_fu-c2b03-06-01_pid4137691.root \
--path HLT_HIUPC_DoubleEG5_BptxAND_SinglePixelTrack_MaxPixelTrack_v* \
> "${hltLabel}".py
cat <<@EOF >> "${hltLabel}".py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.source.skipEvents = cms.untracked.uint32( 90 )
@EOF
cmsRun "${hltLabel}".py &> "${hltLabel}".log |
assign hlt, heterogeneous |
@cms-sw/trk-dpg-l2 @ferencek @mroguljic FYI |
New categories assigned: hlt,heterogeneous @fwyzard,@makortel,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks |
type trk |
The issue has been briefly discussed in the Tracker Operations and the Pixel Offline meetings this week with no definite conclusions at this point but the firmware upgrade was thought to be a possible reason for the observed HLT crashes. A general consensus was that the issue needs to be better understood from the firmware side before any attempts to fix the problem from the offline side, assuming that's the right place to fix it, are made. |
That certainly needs to happen
I beg to differ. Either Pixel operations guarantees this particular firmware never gets uploaded again, or Tracker DPG puts in place a protection against corrupt data. Crashing the HLT is not an option. |
We can try to implement a fix for this particular crash once we understand what really happened but there is no guarantee that this will safeguard the HLT from other possible failure modes. But yes, crashing the HLT is certainly not an acceptable mode of operation. |
Marino though it might be worthwhile to repost from mattermost my recent observations.
|
On Nov-22, 2024, during runs 388769 and 388770 (PbPb stable beams collisions, HLT release
CMSSW_14_1_5_patch2
), we got hundreds of HLT crashes (509 for 388769 e-log and 1 for 388770, e-log) involving the following exception messages:or
The exception is reminiscent of an earlier issue documented at #39045.
From preliminary investigation the crashes seem to be related to a new version of the pixel firmware uploaded online on Nov, 22.
The logs from F3 Mon are attached to the thread.
f3mon_logtable_2024-11-23T08_18_32.480Z.txt
f3mon_logtable_2024-11-23T08_18_18.602Z.txt
Once error stream files will be made available we'll attempt to reproduce.
Cc:
@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2 @trocino @vince502
The text was updated successfully, but these errors were encountered: