Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential speedup in HLT: AXOL1TLCondition::evaluateCondition consider caching the model? #46740

Open
slava77 opened this issue Nov 19, 2024 · 16 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Nov 19, 2024

Looking at a profile of HLT in 14_1_X with callgrind, on MC running only MC_ReducedIterativeTracking_v22, I see that 78% of L1TGlobalProducer::produce is spent in l1t::AXOL1TLCondition::evaluateCondition
https://github.com/cms-sw/cmssw/blob/CMSSW_14_1_0_pre5/L1Trigger/L1TGlobal/src/AXOL1TLCondition.cc#L100

In my test L1TGlobalProducer::produce takes 9.7% of the time; in the full menu it's apparently around 2.5% 0.9% (updated to 0.9%, see notes below)

Of all the time spent in l1t::AXOL1TLCondition::evaluateCondition

  • hls4mlEmulator::ModelLoader::load_model() is 54%
  • hls4mlEmulator::ModelLoader::~ModelLoader() is 30%
  • GTADModel_emulator_v4::predict() is 15%
  • the rest is around 1%

IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.
Some kind of caching may be useful to get HLT to run a bit faster (seems like 1.5% (updated) 0.6% or so).

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 19, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

@aloeliger @artlbv FYI

@aloeliger
Copy link
Contributor

@quinnanm

@aloeliger
Copy link
Contributor

I can also take a look at this when I get some time, frankly I've been meaning to for a while.

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

In my test L1TGlobalProducer::produce takes 9.7% of the time; in the full menu it's apparently around 2.5%

in a more "realistic" setup it's more like sub-percent

Screenshot from 2024-11-19 22-26-13

(details from #45631 (comment))

@makortel
Copy link
Contributor

assign l1

@makortel
Copy link
Contributor

type performance-improvements

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@aloeliger,@epalencia you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor Author

slava77 commented Nov 19, 2024

in a more "realistic" setup it's more like sub-percent

My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi
The odd part there is that e.g. in the first iteration in the first 6 jobs L1TGlobalProducer takes under 1%, but in the last two it's around 6%

@mmusich
Copy link
Contributor

mmusich commented Nov 19, 2024

My 2.5% comes from a more recent test (but on GRun menu) by @bdanzi

I'd rather trust more the manual measurement than what comes from the timing server, tbh.

@slava77
Copy link
Contributor Author

slava77 commented Nov 20, 2024

essentially all of the cost is in dlopen and dlclose calls.
https://github.com/cms-hls4ml/hls4mlEmulatorExtras/blob/v1.1.3/src/hls4ml/emulator.cc#L12-L22
create_model has orders of magnitude lower cost

perhaps just having the .so file loaded in the constructor or begin job is enough

@artlbv
Copy link
Contributor

artlbv commented Nov 21, 2024

IIUC, the load and destruction of the model happens 10 times per event; I don't see any dependence on the current event variables.

The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.

IMO it would be good to have some common approach to this loading of HLS4ML models within L1/CMSSW, as e.g. here it seems to be done differently:
https://github.com/cms-sw/cmssw/blob/CMSSW_14_2_X/L1Trigger/Phase2L1ParticleFlow/interface/JetId.h#L50-L53

@makortel
Copy link
Contributor

The "problem" is that the model is evaluated (and loaded) for every threshold and BX separately, even though it is the same model. If caching is possible that would help of course and in principle the model(s) are known at the beginning of the job as they are fixed in the L1 menu.

Given that the interface of hls4mlEmulator::Model is not const, the best that one can do is to call ModelLoader::load_model() once per module per stream (in constructor of edm::stream module, or using StreamCache with edm::global module, or in constructor of edm::one module).

This is how

unique_ptr<float[]> fDY_;
tensorflow::Session *sessionRef_;
std::shared_ptr<hls4mlEmulator::Model> modelRef_;
};

is used in
fBJetId_ = std::make_unique<JetId>(
cfg.getParameter<std::string>("NNInput"), cfg.getParameter<std::string>("NNOutput"), cache, fNParticles_);

(now if hls4mlEmulator::Model interface would be const and thread-efficient, we could avoid the per-stream copies of the hls4mlEmulator::Model)

@mmusich
Copy link
Contributor

mmusich commented Nov 22, 2024

assign hlt

  • to keep it on the radar

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants