Memory Jump from `14_1_0_pre5` for Phase2 Workflows #45854

AdrianoDee · 2024-09-02T05:47:11Z

In one of the latest round of RelVal productions we have observed an increased usage of memory for the RECO step for Phase2 (D110) PU workflows. We spotted this since many jobs started to fail due to an excessive memory usage. This happened because we usually set the memory limit per job to 16GB (for 8 threads 2 streams jobs) and we went quite frequently above that threshold. See e.g. the error report for a TTbar workflow showing the error report (find below the memory reported by condor to have exceeded the maxPSS). For the equivalent in pre4 we just had a few failures.

See also the PeakValueRss reported in the FrameworkJobReport.xml output for each job. I'm not sure how to interpret the low value tail for pre5.

Bottom line: something happened between pre4 and pre5 that caused the memory usage at RECO step to jump quite a bit.

Reports

I've copied all the job reports and configs for TTbar PU=200 here or you can check in

/eos/cms/store/logs/prod/2024/06/WMAgent/pdmvserv_RVCMSSW_14_1_0_pre4TTbar_14TeV__STD_2026D110_PU_240604_235438_8449/
/eos/cms/store/logs/prod/2024/08/WMAgent/pdmvserv_RVCMSSW_14_1_0_pre5TTbar_14TeV__STD_2026D110_PU_240801_183220_4814/

We have "solved" this on our side by rising the memory used to 20GB. And the situation is unchanged in 14_1_0pre6 (and 14_1_0_pre7) since we are still seing failures if the memory max is set to 16GB.

The text was updated successfully, but these errors were encountered:

AdrianoDee · 2024-09-02T05:48:02Z

assign reconstruction

AdrianoDee · 2024-09-02T05:48:05Z

assign upgrade

cmsbuild · 2024-09-02T05:48:27Z

New categories assigned: reconstruction,upgrade

@jfernan2,@mandrenguyen,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-09-02T05:48:28Z

cms-bot internal usage

cmsbuild · 2024-09-02T05:48:29Z

A new Issue was created by @AdrianoDee.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2024-09-03T14:25:23Z

@AdrianoDee What would be the way to reproduce these jobs in pre4 and pre5 (onwards)?

AdrianoDee · 2024-09-03T15:09:51Z

@makortel if you check this (or under /eos/user/a/adiflori/www/phase2_pre5_memory_issue/)

I've copied all the job reports for TTbar PU=200 here

you will find all the config used by each job (so with the relative lumisections) as a PSet.py (loading a pickle file). E.g. in /eos/user/a/adiflori/www/phase2_pre5_memory_issue/14_1_0_pre5/0/job/WMTaskSpace/cmsRun1 there's everything needed to run it

[adiflori@lxplus957 cmsRun1]$ ls -tlrh
total 67M
-rw-r--r--. 1 adiflori zh 419K Sep  2 01:15 FrameworkJobReport.xml
-rw-r--r--. 1 adiflori zh  128 Sep  2 01:15 PSet.py
-rw-r--r--. 1 adiflori zh  38M Sep  2 01:15 PSet.pkl
-rw-r--r--. 1 adiflori zh  70K Sep  2 01:15 cmsRun1-stdout.log
-rw-r--r--. 1 adiflori zh  54K Sep  2 01:15 Report.pkl
-rw-r--r--. 1 adiflori zh  227 Sep  2 01:15 cmsRun1-stderr.log
drwxr-xr-x. 2 adiflori zh 4.0K Sep  3 16:30 __pycache__
-rw-r--r--. 1 adiflori zh  29M Sep  3 16:30 dump.py

where (dump.py) was just me checking the edmConfigDump output made sense (and it does).

AdrianoDee · 2024-09-03T15:10:52Z

(Made clearer in the description the link includes the job configs too)

jfernan2 · 2024-09-04T15:55:30Z

From RECO side we have been looking at this increase in RSS memory profile since the beginning[1] (indeed there is a drop of RSS to normal levels at the end of the tasks) though igprof profile does not see any difference in mem live consumption[2][3]

We have looked at the allocated memory from individual Reco Modules in pre4[4] and pre5[5] and just a general light increase in across all modules can be seen, but not a single real culprit.

Our suspect from the beginning was the change to AL9 of the test machines, since pre4>pre5 was at almost the same time as the transition of the operative system, but we did not know if this theory made sense since we ignored how RSS mem is managed. I have discarded this hypotesis now by running pre4 again in AL9.

[1] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_12634.21.html
[2] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/12634.21/step3/mem_live.1.html
[3] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/comp_igprof/html/CMSSW_14_1_0_pre5/12634.21/step3/mem_live.399.html

[4] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/circles/piechart.php?local=false&dataset=CMSSW_14_1_0_pre4%2Fel8_amd64_gcc12%2F12634.21%2Fstep3_circles&resource=mem_alloc&colours=default&groups=reco_PhaseII_private&threshold=0

[5] https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/circles/piechart.php?local=false&dataset=CMSSW_14_1_0_pre5%2Fel8_amd64_gcc12%2F12634.21%2Fstep3_circles&resource=mem_alloc&colours=default&groups=reco_PhaseII_private&threshold=0

makortel · 2024-09-04T17:06:05Z

Our suspect from the beginning was the change to AL9 of the test machines, since pre4>pre5 was at almost the same time as the transition of the operative system, but we did not know if this theory made sense since we ignored how RSS mem is managed.

We have seen before reports or suspicion on the EL8/EL9 host OS leading to more memory being used compared to SLC7 #42929 #45028

I have discarded this hypotesis now by running pre4 again in AL9.

@jfernan2 Just to confirm I understood correctly, do you mean that you don't see a significant difference in memory usage between pre4 and pre5 when running on the same node or OS (AL9 I guess)?

@AdrianoDee Is there significant difference in sites or host OS versions where the jobs were run between pre4, pre5, pre6, and pre7?

makortel · 2024-09-04T18:41:51Z

I've copied all the job reports for TTbar PU=200 here

Using the setup from above, I processed the same 10 events of the pre4 input on an EL8 node using pre4 and pre5

pre4 resulted in peak RSS 14133.7 Mbytes and VSIZE 29524.8 Mbytes
pre5 resulted in peak RSS 14329.5 Mbytes (+1.4 %) and VSIZE 30748.9 Mbytes (+4.1 %)

The number of processed events was small, and real statistical conclusions would require more runs, but at least the numbers are not outrageously different.

jfernan2 · 2024-09-05T09:45:13Z

@jfernan2 Just to confirm I understood correctly, do you mean that you don't see a significant difference in memory usage between pre4 and pre5 when running on the same node or OS (AL9 I guess)?

My initial conclusion yesterday was that I was seeing NO difference between AL9 and older OS, based on this plot for wf11834.21:

https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html

where both pre4, pre5 (and pre6) run on AL9, and, except for the very fist few events, RSS was lower for pre4 than pre5 (and pre6)

However, this night ended my test on another crosscheck wf12634 in pre4 using AL9, and things have changed:

https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_12634.21.html

pre4 is now at the RSS levels of pre5 (pre6 and pre7) for more than the half of the first events.

So, it looks like there is some dependence on AL9 and the way it handles the RSS....

I am running some extra tests on pre3 using AL9 (all releases except pre4, pre5, pre6 and pre7 in those plots were made before AL9 entered into the game) and with a Phase2 wf25034.21 using pre4 AL9 in order to confirm.

I also noticed that the trend in RSS over the job is very different in MC (high at the beginning and then dropping) wfs w.r.t. those in data (increasing over time), see e.g.
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_136.889.html
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html

although the number of events in the task for data wfs is an order of magnitude higher than for mc wfs. I am also testing those again in pre4 using AL9 to have another independent view

AdrianoDee · 2024-09-06T08:35:15Z

On our side we submit all the RelVals with SCRAM_ARCH=el8_amd64_gcc12 so I suppose they all run in el8.

jfernan2 · 2024-09-06T09:00:17Z

That's the case too in our profiling tests but the machine says at the beginning of the log:
Building remotely on vocms011 (el9 GenuineIntel reco-profiling no_label cpu-32 amd64)

AdrianoDee · 2024-09-06T09:10:16Z

Ah so there's some possible loop-hole, let me check more carefully.

jfernan2 · 2024-09-06T09:11:05Z

After my last profiling tests in pre3 and pre4 with other MC and data workflows, I confirm that running in AL9 increases the peak RSS by a 20-30% in the same job w.r.t. previous OS versions. The RSS gets increased at the beginning of the task/job and then drops for the last events. I will update the plots above with the latest tests later, you can have all the results in text mode at the following folder (dates from Aug and Sep mean AL9):

ls -ld /eos/cms/store/user/cmsbuild/profiling/data/CMSSW_14_1_0_pre?/el8_amd64_gcc12/*/step3_TimeMemoryInfo.log | grep Sep

makortel · 2024-09-06T13:45:33Z

On our side we submit all the RelVals with SCRAM_ARCH=el8_amd64_gcc12 so I suppose they all run in el8.

Well, yes and no. The jobs will use the el8 binaries, and thus require either el8 host OS or an el8 container, but in the container case, the host OS can be (nearly) anything (as far as scram and CMSSW are concerned).

srimanob · 2024-09-06T16:09:26Z

Hi,
Is the #45854 (comment) answered the increasing of Memory? One thing if you don't spot anything on RECO is DQM. DQM runs on the same step with RECO, so I don't think we can separate memory used for DQM or RECO step in relvals. I see several DQM-related PR added to in 14_1_0_pre5. Just a guess if we don't spot the issue from RECO.

jfernan2 · 2024-09-06T16:27:57Z

Hi @srimanob
The RECO profiling tests show that running in AL9 produces an increase of the RSS memory of 20-30% in the first part of the job, DQM part is excluded in these jobs.

srimanob · 2024-09-06T16:36:13Z

Right @jfernan2
I mean if this 20% increase answers the issue, then we focus on it. If not, we need to look on something beyond RECO, and DQM may be the next target as it is not part of profiling.

Do we know if job crash in very beginning, sometimes later.

jfernan2 · 2024-09-06T16:44:49Z

No, there is no job crash, just RSS memory is increased at the beginning of the job, see e.g. https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html

makortel · 2024-09-06T16:51:18Z

No, there is no job crash, just RSS memory is increased at the beginning of the job, see e.g. https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html

I wonder how reproducible is this RSS behavior wrt. the job progress (in terms of event number)? I.e. how the RSS vs "time" plot would look like if you'd run the same job (e.g. pre5) several times on the same node?

jfernan2 · 2024-09-06T17:16:29Z

I can'y give you an exact answer, but based on the graphs I sent you for several pre-releases, I would expect fluctuations around the mean value as those given in first left plot of https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html

of the order of 400MB around a mean central value. What is probably more significant is the RSS peak value at the left column of table in the link above

srimanob · 2024-09-07T16:07:32Z

Hi @AdrianoDee

Looking on absolute memory we use in pre5 on Alma9, I try to run as RelVals, 100 jobs, 100 events/job, and what I observe is the RSS is quite stable, a little bit lower than 15 GB. RSS is typically larger when compared to PSS, so I don't understand why we have very large PSS. That means RSS is even larger or equal.

For current action, 1 stream should allow you to survive in 16 GB PSS. I also measure it.

8 Threads, 2 Streams

8 Threads, 1 Stream

jfernan2 · 2024-09-10T08:16:01Z

My last test repeating on AL9 the pre3 shows high RSS for the whole job/task, from event 1 to 400:
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_11834.21.html

Same for data wfs on pre4 (on AL9):
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_136.889.html
https://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_14_1_step3_140.047.html

makortel · 2024-09-10T14:00:47Z

Would someone be able to compare (again) the behavior of cmsRun (jemalloc), cmsRunTC (TCMalloc), and cmsRunGlibC (glibc malloc)?

jfernan2 · 2024-09-12T08:33:46Z

For the peak RSS I have also this plot prior to the inclusion of AL9, I do not know if the fall in 13_3_X was due to AL8, I don't recall the dates of the previous Linux version
https://gitlab.cern.ch/cms-reconstruction/cms-reco-profiling/-/blob/main/results/release_mem_run3_8thread.png?ref_type=heads

AdrianoDee · 2024-10-11T11:06:17Z

Hi, we see failures even when running

8 threads 1 stream with 16GB. See e.g. this job and its logs;
8 threads 2 streams with 20GB. See e.g this job and its logs;

AdrianoDee · 2024-10-11T11:25:40Z

Another piece of information that seems interesting to me. When running exactly the same wf, with exactly the same settings, with SCRAM_ARCH=el9_amd64_gcc12 the situation improves. I've run in 14_1_0_pre7 a TTbar Phase2 PU200 wf with the arch set to:

the default el8_amd64_gcc12 and we see 30% of failures for exceeded maxPSS.
el9_amd64_gcc12 and everything is fine.

So it seems something related to the OS (and the fact that we presumably run in containers).

AdrianoDee · 2024-10-11T11:26:53Z

For the moment the only stable and successful setup I found is 8 threads 1 stream and 20GB.

makortel · 2024-10-11T13:25:15Z

the default el8_amd64_gcc12 and we see 30% of failures for exceeded maxPSS.

el9_amd64_gcc12 and everything is fine.

Were the jobs run on the same nodes (or the same underlying OS)?

makortel · 2024-10-11T14:06:27Z

8 threads 1 stream with 16GB. See e.g. this job and its logs;

8 threads 2 streams with 20GB. See e.g this job and its logs;

Plotting the RSS and VSIZE from these results

I think we'd need a serious analysis of where the memory is being used. The 70 GB VSIZE is scary (especially given what we've learned of the OS-dependent behavior of RSS elsewhere, e.g. in #46040 or #42387).

One sort-of obvious memory hog is the (playback of) classical mixing.

dan131riley · 2024-10-11T14:16:34Z

Playback has some serious inefficiencies if the upstream PU producer is multi-threaded/multi-streamed, but I wouldn't expect that to cause memory explosion issues. It might be interesting to run the whole chain with one stream for each step (I believe there should only be one pileup input per stream, so the number of threads shouldn't matter).

Playback could be made much more time and IO efficient with some caching in the embedded file input module, but wouldn't expect a large memory impact.

AdrianoDee · 2024-10-11T14:39:51Z

Were the jobs run on the same nodes (or the same underlying OS)?

Should be el9 for both. I say should because I can't find any evidence in the logs of which is the underlying OS. I need to properly check.

For the playback, note that we see similar failures even when the jobs is (wrongfully) submitted with no PU replay at RECO step (as in this case for example with 20GB 2 streams and 8 cores).

AdrianoDee · 2024-10-11T14:41:38Z

I've resubmitted a couple of wfs excluding the VALIDATION step, since in the past there have been issues there and the prod-like wfs used for the reco-timings don't run it and don't show such a jump.

AdrianoDee · 2024-10-16T10:00:07Z

One further piece. The wf with no VALIDATION or DQM sent with 16GB and 2 streams was successful. As a comparison:

the standard job (-s RAW2DIGI,RECO,RECOSIM,PAT,VALIDATION:@phase2Validation+@miniAODValidation,DQM:@phase2+@miniAODDQM) with 20GB and 2 streams that failed here
the no validation one (-s RAW2DIGI,RECO,RECOSIM,PAT) with 16GB and 2 streams here

From the logs I see the VSIZE and RSS much more under control (the no-validation job ran on twice of the events given no failures).

P.S. let me thank @fabiocos for suggesting the source could be there ( :

AdrianoDee · 2024-10-16T10:14:52Z

assign dqm

cmsbuild · 2024-10-16T10:15:11Z

New categories assigned: dqm

@antoniovagnerini,@nothingface0,@rvenditti,@syuvivida,@tjavaid you have been requested to review this Pull request/Issue and eventually sign? Thanks

srimanob · 2024-10-17T00:06:13Z

Hi @AdrianoDee
It seems I should have bet more last month :)

Thanks for checking. In validation, you include the PU replay as default of validation + DQM, right?
Note also on #38828 if we run something unnecessary twice.

AdrianoDee · 2024-10-17T05:02:11Z

Yes, the replay is included

AdrianoDee · 2024-10-17T05:48:18Z

Ah but note that we have the same memory “explosion" even when not-included.

…

Il giorno 17 ott 2024, alle ore 07:02, Adriano Di Florio ***@***.***> ha scritto: Yes, the replay is included — Reply to this email directly, view it on GitHub <#45854 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA6IGV3J7XHZYSZDWRH7ADZ35AG7AVCNFSM6AAAAABNPTBKXCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYGUYDINRVGE>. You are receiving this because you are subscribed to this thread.

makortel · 2024-10-24T19:00:29Z

I profiled the example job, I think it was this one

8 threads 2 streams with 20GB. See e.g this job and its logs;

, with IgProf, and below is the MEM_LIVE after first event (full profile is here).

1410 MB in EventProcessor constructor
- 547 MB in Source constructor (link)
  - 480 MB in TFile::Open() (link)
    - 479 MB in TFile::ReadStreamerInfo() (link)
      - Mostly from TClass::GetClass()
- 488 MB in EDModule construction (link) after subtracting non-constructor pieces
  - 181 MB in cut/expression parser (link)
    - 85 MB via TopSingleLeptonDQM_miniAOD (link)
    - 33 MB via TopSingleLeptonDQM (link)
    - 22 MB via SingleObjectSelectorBase<edm::View<reco::GsfElectron>, StringCutObjectSelector<reco::GsfElectron, false>, edm::stream::EDFilter<>, ...> (link)
    - 15 MB via SingleObjectSelectorBase<edm::View<reco::Muon>, StringCutObjectSelector<reco::Muon, false>, edm::stream::EDFilter<>, ...> (link)
    - 11 MB via RecoTauPiZeroProducer (link)
    - 5.1 MB via SingleObjectSelectorBase<std::vector<reco::GenJet>, StringCutObjectSelector<reco::GenJet, false>, edm::stream::EDFilter<>, ...> (link)
  - 94 MB in ONNX (link
    - 45 MB via BoostedJetONNXJetTagsProducer
    - 28 MB via UnifiedParticleTransformerAK4ONNXJetTagsProducer
    - 17 MB via pat::MuonMvaIDEstimator, used by pat::PATMuonProducer
    - 4.9 MB via DeepFlavourONNXJetTagsProducer
  - 19+44=63 MB in Tensorflow graphs and sessions
    - 17+35=52 MB in DeepTauId
    - 1.7+3.5=5.2 MB in TfGraphDefProducer
  - 87 MB in GBRForest (link)
    - 29 MB via LowPtGsfElectronSeedProducer
    - 22 MB via MVAValueMapProducer<reco::GsfElectron>
    - 14 MB via LowPtGsfElectronIDProducer
  - 18 MB in DeepTauId (link)
  - 8.3 MB in MuonDTDigis (link)
    - Mostly in ROOT histograms, I guess they get replicated among streams
  - 3.8 MB in ConvertedPhotonProducer (link)
  - 3.0 MB in MuonIdProducer (link)
- 150 MB in ProductRegistry (link)
  - Mostly from ROOT dictionaries
- 75 MB in PSet registry (link)
7240 MB in data processing (edm::EventProcessor::runToCompletion(), link)
- 4010 MB in event processing (link)
  - 1540 MB in MixingModule (link
    - 1532 MB in CrossingFrame<T> data products
      - 495 MB CrossingFrame<PCaloHit> (link 1, (link 2)
      - 425 MB in CrossingFrame<PSimHit> (link 1, link 2)
      - 284 MB in CrossingFrame<edm::HepMCProduct> (link 1, link 2)
      - 262 MB in CrossingFrame<SimTrack> (link 1, link 2)
      - 66 MB in CrossingFrame<SimVertex> (link 1, link 2)
  - 1190 MB in PoolOutputModule (link)
  - 186 MB in PFRecHitProducer::produce() (link) [Phase2 RelVal] O(100 MB) memory held per event from HGCalGeometry::getGeometry() in PFRecHits #46511
    - 133 MB in HGCalGeometry::getGeometry() (link)
  - 159 MB in tensorflow::run() (link)
    - 102 MB via ~~TrackstersProducer~~ TrackstersMergeProducer
      - I think the stack trace generation got confused by the TrackstersMergeProducer::energyRegressionAndID() that actually calls tensorflow::run(), and ticl::PatternRecognitionbyCLUE3D<T>::energyRegressionAndID() which is declared but not defined
    - 39 MB via DeepTauId after subtracting the contribution of DeepTauId constructor
  - 85 MB in RecoTauProducer::produce() (link)
    - This is mostly vector<PFTau> produced by the module
  - 44 MB in TSToSimTSHitLCAssociatorEDProducer::produce() (link)
  - 36 MB in PFClusterProducer::produce() (link)
  - 34 MB in TrackListMerger::produce() (link)
  - 33 MB in SiPixelClusterProducer::produce() (link)
  - 30 MB in HGCalRawToDigiFake::produce() (link)
  - 25 MB in SeedCreatorFromRegionHitsEDProducerT<SeedFromConsecutiveHitsCreator>::produce() (link)
  - 24 MB in HGCalUncalibRecHitProducer::produce() (link)
  - 22 MB in PuppiProducer::produce() (link)
  - 22 MB in edm::FwdPtrCollectionFilter<reco::PFCandidate, reco::PdgIdSelectorHandler, reco::PFCandidateWithSrcPtrFactory>::filter() (link)
  - 21 MB in ClusterTPAssociationProducer::produce() (link)
  - 21 MB in `PFTrackProducer::produce() (link)
- 1490 MB in edm::DelayedReaderInputProductResolver::prefetchAsync_() (link)
- 722 MB in EventSetup (link, after subtraction of all other components that use SerialTaskQueue
  - 198 MB in HGCalGeometryESProducer::produce() (link)
  - 69 MB in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce() (link)
  - 53 MB in SiPixelTemplateStoreESProducer::produce() (link)
  - 50 MB in GBRForestD via CondDB (link)
  - 42 MB in EcalCondObjectContainer<EcalPulseCovariance> via CondDB (link)
- 830 MB in beginRun (global and one, stream)
  - DQM 825 MB
    - 548 MB as edm::stream (link)
      - 71 MB in Phase2TrackerMonitorDigi (link)
      - 56 MB in Phase2OTMonitorCluster (link)
      - 55 MB in Phase2ITMonitorCluster (link)
      - 30 MB in PrimaryVertexAnalyzer4PUSlimmed (link)
    - 192 MB as edm::global (link)
      - 126 MB in MultiTrackValidator (link)
      - 64 MB in HGCalValidator (link)
    - 86 MB as edm::one (link)

makortel · 2024-10-24T19:02:15Z

I also looked the numbers of memory allocations after 2 events, full profile here

425 M allocations in event processing (link)
- 78.8 M in TSToSimTSHitLCAssociatorEDProducer::produce() (link)
  - 78.4 M in TSToSimTSHitLCAssociatorByEnergyScoreImpl::makeConnections() (link)
- 45.6 M in cms::CkfTrackCandidateMakerBase::produceBase() (link)
- 27.1 M in Phase2TrackerMonitorDigi::analyze() (link) [Phase2 RelVal] O(10 million) memory allocations per event by Phase2TrackerMonitorDigi::analyze() #46510
  - Nearly all in MessageLogger
- 26.7 M in HGCalValidator::dqmAnalyze() (link)
  - 16.9 M in HGVHistoProducerAlgo::tracksters_to_SimTracksters() (link)
- 18.8 M in edm::BMixingModule::produce() (link)
  - Nearly all is via delayed reader
- 17.2 M in MuonIdProducer::produce() (link)
- 15.1 M in RecoTauProducer::produce() (link)
- 12.2 M in MtdRecoClusterToSimLayerClusterAssociatorEDProducer::produce() (link)
- 10.7 M in LowPtGsfElectronSeedProducer::produce() (link)
- 9.97 M in ClusterTPAssociationProducer::produce() (link)
43.0 M allocations in beginRun (one and global, stream)
- 18.6 M in cscdqm::Dispatcher::book() (link)
35.6 M in edm::ProductSelector::initialize() (link)
- Nearly all in std::regex
28.7 M in module construction (link)
12.4 M allocations in MTDGeometricTimingDetESModule::produce() (link) [Phase2 RelVal] ~12 million allocations from MTDGeometricTimingDetESModule::produce() #46512
10.5 in PoolSource construction (link)

makortel · 2024-10-24T19:12:27Z

I think it is pretty clear the sheer volume of data products is the main cause for the memory problems, that is ~3 GB of produced data products (of which 1.5 GB are the CrossingFrame<T> alone) plus ~1.5 GB in reading in data, as these scale with number of streams.

The two output modules spending ~1 GB at this stage (I'm sure this would become larger later in the job), and 830 MB in DQM histograms don't help either, even if increasing the number of threads amortizes their cost. The size of ROOT dictionaries is also notable (more than 500 MB or something).

The memory churn is substantial, 212 million allocations per event. Although accounting the average time per event from the logs, this would correspond to about 1 MHz allocation rate per stream, which about the same as in Run 3 prompt reco #46040 (comment).

makortel · 2024-10-24T19:20:37Z

35.6 M in edm::ProductSelector::initialize() (link)

Nearly all in std::regex

So FEVTDEBUGHLToutput.outputCommands has 988 elements (starting with 5 drop *, among other duplication that I didn't look into deeper), and MINIAODSIMoutput.outputCommands has 119.

makortel · 2024-10-24T21:50:10Z

27.1 M in Phase2TrackerMonitorDigi::analyze() (link)

Nearly all in MessageLogger

Spinned off to #46510

makortel · 2024-10-24T22:02:26Z

186 MB in PFRecHitProducer::produce() (link)

133 MB in HGCalGeometry::getGeometry() (link)

Spinned off to #46511

makortel · 2024-10-24T22:09:31Z

assign simulation

Because of the huge cost of the MC truth for VALIDATION (although possible discussion on its improvements would probably be better to be done in other issue).

cmsbuild · 2024-10-24T22:09:47Z

New categories assigned: simulation

@civanch,@kpedro88,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-10-24T22:26:48Z

12.4 M allocations in MTDGeometricTimingDetESModule::produce() (link)

Spinned off to #46512

cmsbuild added reconstruction-pending pending-signatures upgrade-pending labels Sep 2, 2024

makortel mentioned this issue Sep 9, 2024

Memory corruption with AllocMonitors #45964

Closed

AdrianoDee mentioned this issue Oct 14, 2024

Tracking validation updates/improvements #46324

Merged

cmsbuild added the dqm-pending label Oct 16, 2024

makortel mentioned this issue Oct 24, 2024

[Phase2 RelVal] O(10 million) memory allocations per event by Phase2TrackerMonitorDigi::analyze() #46510

Closed

makortel mentioned this issue Oct 24, 2024

[Phase2 RelVal] O(100 MB) memory held per event from HGCalGeometry::getGeometry() in PFRecHits #46511

Open

cmsbuild added the simulation-pending label Oct 24, 2024

makortel mentioned this issue Oct 24, 2024

[Phase2 RelVal] ~12 million allocations from MTDGeometricTimingDetESModule::produce() #46512

Closed

Memory Jump from 14_1_0_pre5 for Phase2 Workflows #45854

Memory Jump from 14_1_0_pre5 for Phase2 Workflows #45854

Comments

AdrianoDee commented Sep 2, 2024 • edited Loading

Reports

AdrianoDee commented Sep 2, 2024

AdrianoDee commented Sep 2, 2024

cmsbuild commented Sep 2, 2024

cmsbuild commented Sep 2, 2024 • edited Loading

cmsbuild commented Sep 2, 2024

makortel commented Sep 3, 2024

AdrianoDee commented Sep 3, 2024 • edited Loading

AdrianoDee commented Sep 3, 2024 • edited Loading

jfernan2 commented Sep 4, 2024

makortel commented Sep 4, 2024

makortel commented Sep 4, 2024

jfernan2 commented Sep 5, 2024 • edited Loading

AdrianoDee commented Sep 6, 2024

jfernan2 commented Sep 6, 2024

AdrianoDee commented Sep 6, 2024

jfernan2 commented Sep 6, 2024 • edited Loading

makortel commented Sep 6, 2024

srimanob commented Sep 6, 2024 • edited Loading

jfernan2 commented Sep 6, 2024

srimanob commented Sep 6, 2024 • edited Loading

jfernan2 commented Sep 6, 2024

makortel commented Sep 6, 2024

jfernan2 commented Sep 6, 2024

srimanob commented Sep 7, 2024 • edited Loading

jfernan2 commented Sep 10, 2024

makortel commented Sep 10, 2024

jfernan2 commented Sep 12, 2024

AdrianoDee commented Oct 11, 2024 • edited Loading

AdrianoDee commented Oct 11, 2024

AdrianoDee commented Oct 11, 2024

makortel commented Oct 11, 2024

makortel commented Oct 11, 2024

dan131riley commented Oct 11, 2024

AdrianoDee commented Oct 11, 2024

AdrianoDee commented Oct 11, 2024 • edited Loading

AdrianoDee commented Oct 16, 2024 • edited Loading

AdrianoDee commented Oct 16, 2024

cmsbuild commented Oct 16, 2024

srimanob commented Oct 17, 2024

AdrianoDee commented Oct 17, 2024

AdrianoDee commented Oct 17, 2024 via email

makortel commented Oct 24, 2024 • edited Loading

makortel commented Oct 24, 2024 • edited Loading

makortel commented Oct 24, 2024

makortel commented Oct 24, 2024

makortel commented Oct 24, 2024

makortel commented Oct 24, 2024

makortel commented Oct 24, 2024

cmsbuild commented Oct 24, 2024

makortel commented Oct 24, 2024

Memory Jump from `14_1_0_pre5` for Phase2 Workflows #45854

Memory Jump from `14_1_0_pre5` for Phase2 Workflows #45854

AdrianoDee commented Sep 2, 2024 •

edited

Loading

cmsbuild commented Sep 2, 2024 •

edited

Loading

AdrianoDee commented Sep 3, 2024 •

edited

Loading

AdrianoDee commented Sep 3, 2024 •

edited

Loading

jfernan2 commented Sep 5, 2024 •

edited

Loading

jfernan2 commented Sep 6, 2024 •

edited

Loading

srimanob commented Sep 6, 2024 •

edited

Loading

srimanob commented Sep 6, 2024 •

edited

Loading

srimanob commented Sep 7, 2024 •

edited

Loading

AdrianoDee commented Oct 11, 2024 •

edited

Loading

AdrianoDee commented Oct 11, 2024 •

edited

Loading

AdrianoDee commented Oct 16, 2024 •

edited

Loading

makortel commented Oct 24, 2024 •

edited

Loading

makortel commented Oct 24, 2024 •

edited

Loading