Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crashes in run 388769 and 388770: InvalidReference exception involving DetSetVector::inserv called with index already in collection; #46783

Open
mmusich opened this issue Nov 23, 2024 · 11 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Nov 23, 2024

On Nov-22, 2024, during runs 388769 and 388770 (PbPb stable beams collisions, HLT release CMSSW_14_1_5_patch2), we got hundreds of HLT crashes (509 for 388769 e-log and 1 for 388770, e-log) involving the following exception messages:

An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 388769 lumi: 2 event: 708614 stream: 14
   [1] Running path 'HLT_HIUPC_DoubleEG5_BptxAND_SinglePixelTrack_MaxPixelTrack_v15'
   [2] Calling method for module SiPixelDigisClustersFromSoAAlpakaHIonPhase1/'hltSiPixelClustersPPOnAA'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 303079452

or

An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 388770 lumi: 94 event: 102837548 stream: 16
   [1] Running path 'DQM_PixelReconstruction_v11'
   [2] Calling method for module SiPixelDigisClustersFromSoAAlpakaPhase1/'hltSiPixelClusters'
Exception Message:
DetSetVector::inserv called with index already in collection;
index value: 353118212

The exception is reminiscent of an earlier issue documented at #39045.
From preliminary investigation the crashes seem to be related to a new version of the pixel firmware uploaded online on Nov, 22.

The logs from F3 Mon are attached to the thread.

f3mon_logtable_2024-11-23T08_18_32.480Z.txt

f3mon_logtable_2024-11-23T08_18_18.602Z.txt

Once error stream files will be made available we'll attempt to reproduce.

Cc:
@cms-sw/hlt-l2 @cms-sw/heterogeneous-l2 @trocino @vince502

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 23, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@missirol
Copy link
Contributor

Some of the error files from those runs can be found at

/eos/cms/store/group/tsg/FOG/error_stream_root/run388769
/eos/cms/store/group/tsg/FOG/error_stream_root/run388770

Below is a reproducer tested on lxplus800 with CMSSW_14_1_5_patch2 using one of those files.

#!/bin/bash

# cmsrel CMSSW_14_1_5_patch2
# cd CMSSW_14_1_5_patch2/src
# cmsenv

hltLabel=hlt
hltMenu=run:388769
globalTag=141X_dataRun3_HLT_v1

hltGetConfiguration \
  "${hltMenu}" \
  --globaltag "${globalTag}" \
  --data \
  --no-prescale \
  --no-output \
  --max-events 1 \
  --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run388769/run388769_ls0186_index000175_fu-c2b03-06-01_pid4137691.root \
  --path HLT_HIUPC_DoubleEG5_BptxAND_SinglePixelTrack_MaxPixelTrack_v* \
  > "${hltLabel}".py

cat <<@EOF >> "${hltLabel}".py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')

process.source.skipEvents = cms.untracked.uint32( 90 )
@EOF

cmsRun "${hltLabel}".py &> "${hltLabel}".log

@mmusich
Copy link
Contributor Author

mmusich commented Nov 25, 2024

assign hlt, heterogeneous

@mmusich
Copy link
Contributor Author

mmusich commented Nov 25, 2024

@cms-sw/trk-dpg-l2 @ferencek @mroguljic FYI

@cmsbuild
Copy link
Contributor

New categories assigned: hlt,heterogeneous

@fwyzard,@makortel,@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Nov 25, 2024

type trk

@cmsbuild cmsbuild added the trk label Nov 25, 2024
@ferencek
Copy link
Contributor

The issue has been briefly discussed in the Tracker Operations and the Pixel Offline meetings this week with no definite conclusions at this point but the firmware upgrade was thought to be a possible reason for the observed HLT crashes. A general consensus was that the issue needs to be better understood from the firmware side before any attempts to fix the problem from the offline side, assuming that's the right place to fix it, are made.

@mmusich
Copy link
Contributor Author

mmusich commented Nov 27, 2024

A general consensus was that the issue needs to be better understood from the firmware side

That certainly needs to happen

before any attempts to fix the problem from the offline side, assuming that's the right place to fix it, are made.

I beg to differ. Either Pixel operations guarantees this particular firmware never gets uploaded again, or Tracker DPG puts in place a protection against corrupt data. Crashing the HLT is not an option.

@ferencek
Copy link
Contributor

We can try to implement a fix for this particular crash once we understand what really happened but there is no guarantee that this will safeguard the HLT from other possible failure modes. But yes, crashing the HLT is certainly not an acceptable mode of operation.

@dkotlins
Copy link
Contributor

dkotlins commented Dec 1, 2024

Marino though it might be worthwhile to repost from mattermost my recent observations.
I have done the following:

  1. checked that the C++ code works for misplaced channel for the raw2digi and the clusterizer, so it is not the question on container but the GPU implementation

  2. the same error, channel out of order, appears for one of the events, provided by Marino, which caused HLT problems.

  3. In the C++ implementation the right module has 1 pixel hit less and the misplaced hit appears in a another module as a single pixel cluster.

  4. It seems to me that the easiest way to proceed is to modify the raw2digi code. There it is easy to catch this case, channels are really out of order. We can either skip the spurious hit or flag it with an error. There is already an error code (35) foreseen for invalid channels (>48), we could use it also for this case.

  5. Finally, I also checked that the long run in the next fill, where the v19.2 firmware was used, it does not have a single case of this error. So I think it is clear that it is related to the new firmware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants