-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP2K jobs slower with higher number of cores per worker #27
Comments
Hi @b-mazur , The energy of the isolated atoms depends on the specific pseudopotential used as well as the functional, so tabulating those would be quite an amount of work. In normal scenarios, these atomic energy calculations usually finish quite quickly (a few seconds -- a few minutes) so it's usually easier to do it on the fly. This is a bug that I've encountered once on a very specific cluster here in Belgium, and I haven't quite figured out what causes it. Heuristically, I've found that by adding / removing a few MPI flags, CP2K performance will go back to normal again, but I don't quite understand why this is the case given that everything is executed within a container. What is the host OS, and host container runtime (singularity/apptainer version)? Did you modify the default MPI command in the .yaml? |
It's one of the things we're fixing in the next release. |
Hi @svandenhaute, apologies for the long silence. I'm still facing this problem, I've tried to calculate SP with cp2k container (oras://ghcr.io/molmod/cp2k:2023.2) and calculations finished in ~1 min (so the good news is you fixed it in the new release). I've already tried different options with My host OS is
and I was also thinking of moving to psiflow 3.0.4 since these problems do not occur, but I am currently interested in incremental learnig to create an MLP for the phase transition in MOF. I see that |
I was also wondering about using the new CP2K container in psiflow 3.0.3. Is this possible? |
I honestly don't know. If you have tried both
Yes, we are actually in the final stages here. The tentative timeline in this sense is to create a new release (including working examples of the incremental learning scheme) by this Sunday.
No, they are not compatible. The new CP2K container is built with OpenMPI instead of MPICH, but also does not contain psiflow or its dependencies (which is required for compatibility with 3.x). If possible, I'd strongly suggest to wait until the new release is out. Aside from this, it should fix a bunch of other issues! |
Great to hear! I'll wait for the next release then. Thanks a lot for your help and quick reply. |
Discussed in #26
Originally posted by b-mazur May 17, 2024
I'm trying to reproduce mof_phase_transition.py example and I'm facing issue where with increasing number of cores per worker my calculations gets prohibitively slow. In all cases
max_walltime: 20
results inAssertionError: atomic energy calculation of O failed
because none of the CP2K tasks for oxygen are completed in 20 minutes.I played a bit with different number of cores per worker and here are values of SCF steps reached in 20 minutes for oxygen task with multiplicity 5:
Finally I was able to finish this part by increasing max_walltime to 180 minutes and using only 1 core per worker but this will create another issue when ReferenceEvaluation is used for whole MOF in next steps.
I've never used CP2K but I feel that 180 minutes is far too long for single point of single atom. What else I observe is the surprisingly low CPU performance of slurm tasks, at levels of <10%. I checked timings in CP2K output but MPI timing doesn't seem to be such large (however, as I said, I have no experience so maybe I don't understand something). Here is an example:
I'm using psiflow 3.0.4 and container
oras://ghcr.io/molmod/psiflow:3.0.4_python3.10_cuda
.Any idea what I could check to find where the problem is? Also, wouldn't it be better to tabulate energy for all atoms in psiflow source files? Thanks in advance for any help!
The text was updated successfully, but these errors were encountered: