Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFN-FF timing information, optimization for MD #1195

Open
venturellac opened this issue Feb 12, 2025 · 7 comments
Open

GFN-FF timing information, optimization for MD #1195

venturellac opened this issue Feb 12, 2025 · 7 comments
Labels
method: GFN-FF Related to the GFN-FF method support Question regarding this project or underlying method

Comments

@venturellac
Copy link

I am running ONIOM molecular dynamics with GFN-FF as the low level and GFN2-XTB as the high level in ALPB solvent. Total system is ~5000 atoms, SQM region is ~60 atoms. Using 24 CPU cores. I notice the reported timings within the XTB logs for each energy+gradient are quite fast (0.003 seconds for GFN-FF, 0.4 seconds for GFN2-XTB). In practice, each MD step is ~5 seconds, and from my testing seems to be dominated by GFN-FF energy+gradient time. I know in the GFN-FF publication the expected time per step for a similar system should be around 1 second, and I am using more resources. I would be very happy with 0.5-1 seconds per step on these resources. Are there any further optimizations or tricks I can pursue to speed up my MD simulations, or to get some accurate baseline here? Any help would be greatly appreciated!

@marcelmbn marcelmbn added support Question regarding this project or underlying method method: GFN-FF Related to the GFN-FF method labels Feb 13, 2025
@foxtran
Copy link
Contributor

foxtran commented Feb 13, 2025

I have CPU efficiency about 80% for 24 CPU cores for pure GFN-FF for 800 atomic system. The stupid way to check it is to run your MD for couple minutes with time command and get something like:

real	0m50.756s
user	16m47.485s
sys	0m1.725s

Then, you can divide user time to real, and divide result on # of CPUs. For me, it is: (16*60+47)/50/24=0.839. Not bad, but could be better.

Could you please do the same trick for your calc?

@venturellac
Copy link
Author

Thank you for your reply - this is a helpful start. My system is reporting high CPU utilization, similar to yours ~80%, but the time to evaluate the GFNFF whole system (the bottleneck) is not well accelerated with more cores. The runtime is the same, whether I use 4 cores or 24 cores (around 7 seconds). In other words, it appears GFNFF on the whole system is using the cores, but somehow not benefitting from the usage.

Could you please share some details about environment variables and xtb build instructions I can share with my computing center specialists? Here is an example job I am running to benchmark (xtb version 6.6.1) where ncores = 24 or 4:

`ncores = 24

export OMP_STACKSIZE=20G
export OMP_NUM_THREADS=ncores
export MKL_NUM_THREADS=ncores
export OMP_MAX_ACTIVE_LEVELS=1

xtb my_6000_atoms.pdb --oniom gfn2:gfnff my_inner_region_indices --alpb water --verbose`

@foxtran
Copy link
Contributor

foxtran commented Feb 13, 2025

Looks like almost all time xtb spends on data sharing/context switching between threads. Especially for larger number of threads.

For 4 cores, I have:

real	1m17.868s
user	4m10.939s
sys	0m2.928s

Again, with 80% CPU efficiency.

So, you can try to find an optimal number of threads. I'm using 4 threads always and submit more tasks instead to occupy the whole node.

@venturellac
Copy link
Author

Ok this is good to know. I guess I should use 4 or fewer cores for my workload. If it helps for benchmarking/knowledge purposes here are are some detailed timings for GFNFF energy/gradient evaluation:

Component 1 Core 4 Cores 8 Cores 24 Cores
E+G (total) 10.915s 7.570s 8.641s 7.323s
Distance/D3 list 0.105s 0.104s 0.106s 0.104s
Non-bonded repulsion 0.146s 0.047s 0.026s 0.010s
dCN 0.396s 0.384s 0.446s 0.392s
EEQ energy and q 2.500s 1.035s 1.368s 1.112s
D3 2.213s 1.621s 2.185s 1.561s
EEQ gradient 1.175s 0.916s 0.867s 0.854s
Bonds 0.355s 0.261s 0.327s 0.278s
Bend and torsion 0.007s 0.002s 0.001s 0.002s
Bonded ATM 0.008s 0.002s 0.001s 0.001s
HB/XB (incl. list setup) 1.405s 1.000s 1.180s 0.895s
GBSA 3.431s 2.976s 2.925s 2.916s

If I have time I can look into the implementation, but is there any particular reason why specific terms in the hamiltonian benefit from threading or not?

@foxtran
Copy link
Contributor

foxtran commented Feb 13, 2025

I did not update D3 part in #1178, so it will still have poor parallelization, looks like EEQ energy and q is not well parallelized (see 8 core result), HB/XB - not well parallelized. I'd say that GBSA is not parallelized (and it tooks too much time for you).

@venturellac
Copy link
Author

venturellac commented Feb 14, 2025

Good to know, I also notice that the SHAKE algorithm dominates the memory usage. shake=2 (all bonds) segfaults on 32GB memory, shake=1 or 0 are fine it seems, even on a modest 16GB memory. Is this what you would expect? In your opinion, how challenging would it be to parallelize these bottleneck terms like D3, GBSA, and EEQ terms? I may try do so with my code.

@marvinfriede
Copy link
Member

For D3 and EEQ, we have library implementations with much better code quality. Correspondingly, I am fairly certain that the parallelization is also much better. In the long-term, we plan to replace the separate implementations in xtb with the libraries. But this will obviously take some time to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
method: GFN-FF Related to the GFN-FF method support Question regarding this project or underlying method
Projects
None yet
Development

No branches or pull requests

4 participants