NonLocal ECP NaN with Batched Code #4941

annette-lopez · 2024-03-15T14:00:25Z

Describe the bug
Doing a standard workflow for the spin density of neutral bulk aluminum (SCF > NSCF > Convert > J2 opt > J3 opt > DMC). The J2 optimization returns a higher variance/energy ratio ~0.3, but completes without issue. Upon a consecutive J3 optimization the following error results:
optJ3.zip
nexus_cpu.py.zip

QMCHamiltonian::updateComponent component NonLocalECP returns NaN.
  ParticleSet 'e' contains 12 particles :  u(6) d(6)

    u      -8.264482844      -147.1029056       -19.5066447
    u       33.76739009       27.75172402       118.3710551
    u       1.672580761       55.78669558       37.65452937
    u        39.2687636       85.68134103       20.34447848
    u       42.37720707       3.460448312      -45.80725706
    u       -72.5109913       88.43526486      -30.84771678
    d       32.46210945      -32.54069576       49.79376128
    d      0.1307777896       27.86552943       82.42433044
    d       30.18674145       44.99273039       11.11852865
    d      -49.23514243      -21.45162295       57.21033009
    d      -2.465547676      -12.37312728      -21.73156034
    d      -3.049880534      -9.460116645       90.46841502

  Distance table for dissimilar particles (A-B):
    source: ion0  target: e
    Using structure-of-arrays (SoA) data layout
    Distance computations use orthorhombic periodic cell in 3D.

  Distance table for similar particles (A-A):
    source/target: e
    Using structure-of-arrays (SoA) data layout
    Distance computations use orthorhombic periodic cell in 3D.

Unexpected exception thrown in threaded section
Fatal Error. Aborting at Unhandled Exception``` 

**To Reproduce**
Ran on Perlmutter with QMCPACK 3.17.9 batched code.

**Expected behavior**
The J3 optimization aborts after a few initial cycles. It should complete after a total of 9 series.

**System:**
Modules loaded:
module unload gpu/1.0
module load cpu/1.0
module load PrgEnv-gnu
module load cray-hdf5-parallel
module load cray-fftw
module unload cray-libsci
module unload darshan
module load cmake

Note: these runs are generated and submitted with Nexus.

prckent · 2024-03-15T21:41:10Z

This looks like a bug in the J3. It seems unlikely that any numerical or statistical issue would be to blame since the system is so small. Usefully this is a pure MPI CPU run, so we can rule out anything exotic on the computational side. Puzzling that it hasn't shown up for anyone else. It is interesting that some of the electrons have wandered a long way in terms of the primitive cell dimensions. This shouldn't matter, but perhaps it does...

prckent · 2024-03-15T23:31:58Z

Please can you put the wavefunction file somewhere accessible or give a pointer to your Perlmutter directories and set the permissions appropriately.

annette-lopez · 2024-03-20T12:17:18Z

The directory has been shared on Perlmutter here: /global/cfs/cdirs/m2113/al_J3

ani-adavi · 2024-03-22T21:45:51Z

J3_issue.zip

I am running the code on Polaris (QMCPACK 3.17.9 under /soft/applications/qmcpack/develop-20240118/) with legacy drivers, CPU only complex build and also encounter an NaN error during J3 optimization with a similar workflow. The code seems to run without any error when I reduce minwalkers in the first few cycles to 0.01, but this results in large jumps in energy and variance. Please let me know if more information is needed.

prckent · 2024-03-24T14:51:33Z

Thanks for the report. We have also heard that turning down the meshfactor can trigger the problem in J2 in the original problem. Could also be #4917 or something like it.

prckent · 2024-04-24T15:52:53Z

This has been sitting around for a month so I wanted to update the status. I have been experimenting with Ilkka’s single atom version. While this appears to have the same problem it could have its own issues due to being so small:

the problem with bulk Al is straightforward to reproduce within minutes on a CPU system.
the problem is not related to the plane wave cutoff/spline grids, since these are well converged.
seemingly a good wavefunction is produced with D+J1+J2 optimization
however, the one shift optimizer immediately takes a crazy step (very large change in coefficients, 10^9) when J3 is added. This subsequently results in a NaN during pseudopotential evaluation. The abort is therefore correct and not a bug — the problem is related to the optimizer or wavefunction.
this applies even when large numbers of samples are used for optimization - the optimizer tries to make a bad step.

It is worth noting that J3 is not expected to do very much here, but it still shouldn't go wrong like this. Conservative settings (e.g. increasing minwalkers) seems to only delay the problem.

It has been reported that using different optimizers can avoid the problem, but since they aren't necessarily optimizing the same objective function, they may be bypassing the problem rather than being immune to it.

My suspicions are that:

J3 may somehow have a bug for this case. How other people have been able to use J3 successfully is a puzzle that would presumably by answered by identifying the bug.
OneShift needs a better default or more conservative handling for this case for reasons that have yet to be determined.

Will try some larger cells now.

annette-lopez · 2024-04-24T16:12:06Z

Thank you for the update!

…

On Wed, Apr 24, 2024 at 11:53 AM Paul R. C. Kent ***@***.***> wrote: This has been sitting around for a month so I wanted to update the status. I have been experimenting with Ilkka’s single atom version. While this appears to have the same problem it could have its own issues due to being so small: - the problem with bulk Al is straightforward to reproduce within minutes on a CPU system. - the problem is not related to the plane wave cutoff/spline grids, since these are well converged. - seemingly a good wavefunction is produced with D+J1+J2 optimization - however, the one shift optimizer immediately takes a crazy step (very large change in coefficients, 10^9) when J3 is added. This subsequently results in a NaN during pseudopotential evaluation. The abort is therefore correct and not a bug — the problem is related to the optimizer or wavefunction. - this applies even when large numbers of samples are used for optimization - the optimizer tries to make a bad step. It is worth noting that J3 is not expected to do very much here, but it still shouldn't go wrong like this. Conservative settings (e.g. increasing minwalkers) seems to only delay the problem. It has been reported that using different optimizers can avoid the problem, but since they aren't necessarily optimizing the same objective function, they may be bypassing the problem rather than being immune to it. My suspicions are that: - J3 may somehow have a bug for this case. How other people have been able to use J3 successfully is a puzzle that would presumably by answered by identifying the bug. - OneShift needs a better default or more conservative handling for this case for reasons that have yet to be determined. Will try some larger cells now. — Reply to this email directly, view it on GitHub <#4941 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQQOTDVZ6G65DXUPP27VTHTY67IOVAVCNFSM6AAAAABEYBZQGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVGI4DAMRXGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ye-luo · 2024-04-24T17:58:23Z

@prckent
For Illka's reproducer, is there a github issue or where can I get the reproducer?

prckent · 2024-04-24T18:06:49Z

Ilkka's reproducer is a modified version of Annette's. You'll need a working python ase.
ilkka.tar.gz

It is worth considering if the 2 up / 1 down electron case is properly handled in J3.

markdewing · 2024-04-24T18:44:51Z

Have you looked at the eigenvalue chosen by the mapping step after the eigenvalue solve? I don't think it gets printed out currently, but it probably should be.

prckent · 2024-04-24T19:01:36Z

@markdewing could this be #4917 ?

markdewing · 2024-04-24T19:05:25Z

Yes, it could be. The extremely large step is one of the symptoms.

annette-lopez · 2024-08-06T23:44:00Z

Update: I still see the issue with the latest QMCPACK on NM Al, however, at Gani's suggestion, when using the quartic optimizer I do not see the issue anymore.

jtkrogel · 2024-08-07T20:21:43Z

Added a new label for ongoing issues w/ batched.

prckent · 2024-08-21T18:00:58Z

Tagged this for v4. I think we need at least an understanding of what causes this if not a fix. i.e. OK to postpone only if there is a workaround and we have sufficiently shown that the problem is not an underlying bug, more an algorithmic limitation.

prckent added the bug label Mar 15, 2024

jtkrogel added the batched_bug label Aug 7, 2024

prckent added this to the v4.0.0 Release milestone Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NonLocal ECP NaN with Batched Code #4941

NonLocal ECP NaN with Batched Code #4941

annette-lopez commented Mar 15, 2024

prckent commented Mar 15, 2024 •

edited

Loading

prckent commented Mar 15, 2024

annette-lopez commented Mar 20, 2024

ani-adavi commented Mar 22, 2024

prckent commented Mar 24, 2024

prckent commented Apr 24, 2024

annette-lopez commented Apr 24, 2024 via email

ye-luo commented Apr 24, 2024 •

edited

Loading

prckent commented Apr 24, 2024 •

edited

Loading

markdewing commented Apr 24, 2024

prckent commented Apr 24, 2024

markdewing commented Apr 24, 2024

annette-lopez commented Aug 6, 2024

jtkrogel commented Aug 7, 2024

prckent commented Aug 21, 2024 •

edited

Loading

NonLocal ECP NaN with Batched Code #4941

NonLocal ECP NaN with Batched Code #4941

Comments

annette-lopez commented Mar 15, 2024

prckent commented Mar 15, 2024 • edited Loading

prckent commented Mar 15, 2024

annette-lopez commented Mar 20, 2024

ani-adavi commented Mar 22, 2024

prckent commented Mar 24, 2024

prckent commented Apr 24, 2024

annette-lopez commented Apr 24, 2024 via email

ye-luo commented Apr 24, 2024 • edited Loading

prckent commented Apr 24, 2024 • edited Loading

markdewing commented Apr 24, 2024

prckent commented Apr 24, 2024

markdewing commented Apr 24, 2024

annette-lopez commented Aug 6, 2024

jtkrogel commented Aug 7, 2024

prckent commented Aug 21, 2024 • edited Loading

prckent commented Mar 15, 2024 •

edited

Loading

ye-luo commented Apr 24, 2024 •

edited

Loading

prckent commented Apr 24, 2024 •

edited

Loading

prckent commented Aug 21, 2024 •

edited

Loading