-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DIVBYZERO and INVALID FPEs in CMS tests (pp_dy012j.mad in P2_uc_epemuc/G2) #942
Comments
@choij1589 I imagine that you probably have this in the runcard
just try removing the -O3 and see if there is a difference when you have it or not have it Let me know also about if it goes much slower without -O3... (it should not, if you only do this in the cuda or cpp runs, but we never know) |
In my run_card.dat, there is no global_flag assigned, so it's using the default one
I have tried both |
Thanks Jin! Can you check if you have a GLOBAL_FLAG in Source/make_opts anyway please? |
Thanks Jin I am having a look. One point Jin (also for Sapta): you are aware of ccache in the cudacpp? I saw that DY012j is now running builds since more than 1h, it might take 3h, so I do not even want to imagine DY+4jets... I imagine that normally you will run these builds only once, or at least only once per software version, but I can very much imagine that sometimes you will rerun the same test with the same build. In that case, using ccache helps enormously. You should have ccache installed, export USECCACHE=1, and have CCACHE_DIR point to a directory that contains your build caches. I never use AFS or EOS because (with or without cacche) builds are horrible there, so CCACHE_DIR is a local disk on my machine. I never managed to configure this with network CCACHE, but in principle also possible. Maybe one thing to investigate for your gen productions. |
Thanks Jin again. A couple more questions
Thanks |
…ebug madgraph5#942 (it takes 45s to generate!)
Thanks @valassi , regarding the FPE warning, there is no such warnings in fortran, only happening in CUDA. if I comment out fpeEnable() function, than it gives warnings 'FPE_xx' which are triggered from MatrixElementKernel.cc. Otherwise, it would give me crash in runtime. Regarding the subprocess, this happens in most of the subdirectory from DY+0j to DY+4j, I have quickly checked with u u~ > e+ e- process and it also generated the same warning. Here is the input_app.txt for P1_uux_epem:
|
Regading using cached directories, I am currently using condor environment for testing productions so I should build the processes in somewhere(might be EOS area) and xrdcp throught the condor's own disk... but first build on EOS with local node (lxplus8-gpu) would be painful due to other user's interruption:(. But still I believe it would be a good option to check. I will test it around. Thanks! |
Thanks Jin, ok I understand this is not very usable now. But lets keep this in mind. I opened #954. |
Thanks Jin! Ok so I understand these are warnings and not crashes, only because you modified the code and disabled the FPEs... On my side I am redoing some tests of DY+2jets. I did see something that looks very much like your FPE issues
I actually have very many FPEs. I was chaining several backends, fortran, cpp's, cuda, so I am not 100% sure where it happened. But I imagine it was cuda. |
Note
|
Ok here is a reproducer
And through gdb
|
It seems that the issue happens both with -O3 and with -O, -O1 or without -O... (in the fortran side) |
Poor man's debugging
gives
In other words: this does not look like a SIMD issue (so it makes sense that -O3 or -O1 makes no difference) It seems that there is a real issue, one event has 0 weight? Why only in cpp and not in fortran?... |
Ops... also in fortran it is 0, but then why does it not crash?
|
In principle All_PD(0, ivec) can not be zero ... Now clearly, given your update I'm wrong somewhere, ... |
Thanks @oliviermattelaer I was going to ask you in fact ;-) More debugging
The variable IS 0 both in cpp and fortran. The former prints T and crashes, the latter prints T and does not crash
I imagine that the difference in behaviour (crash or no crash) is just because in cpp there is an explicit call to fpeenable that crashes on divbyzero. Now why this is 0 is probably for Olivier to debug... What I will look at in the meantime is if there is a possible workaround to skip some computations if that is zero (i.e. cure the symptoms, not cure the root cause) |
I would say that it is pointless to cure the symptoms |
Hi yes I had a look and I agree :-) Indeed the crash is BEFORE the ME calculation. So what is 0 is not the ME, it is the PDF?! The ALL_PD is a product of U1 and C2 which are the PDFs. So this indicates a zero pdf, which clearly looks like a bug...
|
Looks like my fortran code is not the same as yours (I'm in a middle of merge, which is likely the cause --but means that such merging is super problematic--) but your fortran code makes more sense than mine (and correspond to my explanation above). (Sorry but before focusing on this, will need to fix the merging, one issue at the time). Now given that "ALL_PD" is a local variable indeed the only possible explanation is that they are ... zero due to the PDF call indeed. Now this likely means that the momenta are likely "bad", The question is obviously why? One option would be because of the black-box vetoing events, this is possible here but I doubt that this is what is happening here. (this will likely be REWGT set to zero). The second hyppothesis would be that this is due to a reset of the code after 10 events, Those can occur but typically this happens after the call to the matrix-element... but can you try to set in the run_card: False = mc_grouped_subproc |
Thanks Olivier... maybe open another issue for the merging and we follow that up. Note, I merged a lot in june24, I am waiting for that. (And I know you are waiting for other things from me... sorry).
Ok I will try later this week (will be off this afternoon) |
Just to follow up on the merging (on which I will open 2 PR on that). This why my mistake (or bias, since I read what "I wanted to see"). So here they are no imirror (which is actually handled in auto_dsig.f, but you still have iproc=2 due to the e+ e- and mu+mu- flavor difference in the final state (where both have identical matrix element). Sorry for the noise here. |
…if CUDACPP_RUNTIME_DISABLEFPE is set madgraph5#942 (temporary?)
…ebug madgraph5#942 (it takes 45s to generate!)
…ebug madgraph5#942 (it takes 45s to generate!)
This is an issue reported by Jin @choij1589 (thanks!) during yesterday's meeting with CMS https://indico.cern.ch/event/1373473/ (issue not in the slides, it was reported in the later discussion)
Details are here https://github.com/choij1589/madgraph4gpu/tree/dev_cms_integration
See in particular this commit master...choij1589:madgraph4gpu:dev_cms_integration
Description and analysis
IIUC there were some DIVBYZERO (and INVALID?) FPEs during the CMS madevent tests - note in particular that this was during CUDA runs, so they cannot come from vectorized C++ code.
My initial guess is that this comes from Fortran code, probably from auto-vectorized Fortran code. We saw something similar in #855 for rotxxx, worked around in #857 with a volatile.
It would be annoying if these issues keep popping up, because decorating the full fortran code with volatile's does not look like a scalable solution.
@oliviermattelaer : is there an easy way in which @choij1589 can remove -O3 from fortran and replace it with -O1 when running madgraph in his CMS environment? (just as a test to see if this makes it disappear)
The text was updated successfully, but these errors were encountered: