-
Notifications
You must be signed in to change notification settings - Fork 61
Fix reinitialization issues when using total forces in NAMD #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
5724424 to
d54b6e1
Compare
|
|
d54b6e1 to
ce0385b
Compare
|
|
|
While commit 8cc8ef9 seems to fix the problem, in local testing I still get a crash from |
|
Commit 8cc8ef9 should have solved the crash in #891. As I have said, while the crash doesn't appear anymore on Github CI, I still get another crash in my local test with
diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..a525f73a 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -468,6 +468,9 @@ int GlobalMasterServer::callClients() {
if (positions[i].atomID == rqAtoms[j]){
clientAtomPositions.add(receivedAtomPositions[positions[i].index]);
clientAtomIDs.add(rqAtoms[j]);
+ if (master->requestedTotalForces()) {
+ clientReceivedForceIDs.add(rqAtoms[j]);
+ }
--i; // rqAtoms may contain repeats
if ( ++j == num_atoms_requested ) break;
}
@@ -498,7 +501,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, receivedTotalForces.begin());
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();But then (i) the list of total-force atom IDs is sorted the same as the list of normal atom IDs requesting positions, and this changes the numerical results of total forces and fails the tests requiring total forces, and (ii) I don't know if Colvars needs to "introspect" the forces from TCL forces. In other words, I don't know if this is what Colvars want;
while ((i < this->size()) && (stream >> (*this)[i])) {@giacomofiorin Could you look into the issues above? |
I have worked out a NAMD patch for the second issue that sorts the total forces array correctly and therefore does not have the incorrect numerical results: diff --git a/src/GlobalMasterServer.C b/src/GlobalMasterServer.C
index 8988998b..d04b6a5e 100644
--- a/src/GlobalMasterServer.C
+++ b/src/GlobalMasterServer.C
@@ -442,6 +442,30 @@ int GlobalMasterServer::callClients() {
}
sort(positions.begin(), positions.end(), atomID_less());
+ // Same as above, we need a vector to sort the atom IDs requiring total forces
+ vector <position_index> total_forces;
+ // Check if any of the clients requires total forces
+ bool total_forces_requested = false;
+ while (m_i != m_e) {
+ GlobalMaster *master = *m_i;
+ if (master->requestedTotalForces()) {
+ total_forces_requested = true;
+ break;
+ }
+ m_i++;
+ }
+ if (total_forces_requested) {
+ for (int j = 0; f_i != f_e; ++f_i, ++j) {
+ position_index temp;
+ temp.atomID = *f_i;
+ temp.index = j;
+ total_forces.push_back(temp);
+ }
+ sort(total_forces.begin(), total_forces.end(), atomID_less());
+ }
+
+ // Reset the client pointer;
+ m_i = clientList.begin();
/* call each of the masters with the coordinates */
while(m_i != m_e) {
int num_atoms_requested, num_groups_requested, num_gridobjs_requested;
@@ -474,6 +498,36 @@ int GlobalMasterServer::callClients() {
}
if ( j != num_atoms_requested ) NAMD_bug(
"GlobalMasterServer::callClients() did not find all requested atoms");
+ if (master->requestedTotalForces()) {
+ int k = 0;
+ // FIXME: There's no reliable way to check if this is really the first timestep
+ // At the first time step, the total forces may not be ready
+ const bool total_forces_ready = receivedTotalForces.size() == rqAtoms.size();
+ for (int i = 0; i < total_forces.size(); i++) {
+ if (total_forces[i].atomID == rqAtoms[k]){
+ clientReceivedForceIDs.add(rqAtoms[k]);
+ if (total_forces_ready) {
+ clientReceivedTotalForces.add(receivedTotalForces[total_forces[i].index]);
+ }
+ --i; // rqAtoms may contain repeats
+ if ( ++k == num_atoms_requested ) break;
+ }
+ }
+ // FIXME: For Colvars, it is possible to do
+ // cv configfile xxx.colvars
+ // run 20
+ // structure yyy.pdb
+ // cv reset
+ // cv configfile yyy.colvars
+ // run 20
+ // The problem arises when "yyy.colvars" and "xxx.colvars" select different sets of
+ // atoms for total forces. There is no good way to check if this is the first step and
+ // total forces should be skipped here (but Colvars would check it anyway and discard
+ // the invalid result).
+ // if ( total_forces_ready && k != num_atoms_requested ) {
+ // NAMD_bug("GlobalMasterServer::callClients() did not find all requested total forces");
+ // }
+ }
}
AtomIDList::iterator ma_i = clientAtomIDs.begin();
@@ -498,7 +552,7 @@ int GlobalMasterServer::callClients() {
gtf_i,gtf_i+(numForceSenders?master->old_num_groups_requested:0),
goi_i, goi_e, gov_i, gov_e,
forced_atoms_i,forced_atoms_e,forces_i,
- receivedForceIDs.begin(),receivedForceIDs.end(),receivedTotalForces.begin());
+ mf_i, mf_e, mtf_i);
NAMD_EVENT_STOP(1, NamdProfileEvent::GM_PROC_DATA);
a_i = receivedAtomIDs.begin();
p_i = receivedAtomPositions.begin();But I don't know if Colvars really needs it. |
I would like to have a base-class function for the atoms map, because it's not a NAMD-specific issue. LAMMPS and GROMACS are not using it yet, but simply because they lack this optimization.
Ask the other folks at UIUC (including Jim), and you will find no fans of GlobalMaster's multiplexed design. But the alternative would have been multiple rounds of reduction/broadcast, and I think at the time it was not only more coding work but also potentially harmful to multinode performance.
I don't think we know what each person using Colvars really wants, but I agree that that's a very unlikely use case.
Well, you did that just hours after |
|
Editing the PR's title and description for merging. Will close #891 |
I didn't use
I have tried to solve the issue and revise the NAMD patch that I posted here in https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486. Could you take a look at that NAMD MR to see if Colvars really needs it to work correctly? |
It's not just because of the multiple GlobalMaster objects, but also because the messages received by GlobalMasterServer are not in order, either by task ID or by atom. Typical solutions to this would be sorting the atom buffer to match the order that Colvars expects, or using a lookup table. I am less familiar with the CUDAGM code in the NAMD repo than with the interface code that you have added here in the Colvars repo, but it seems that you are effectively sorting the atomic coordinates as you gather them? I should also have been clearer on the other backends:
|
Doing just that. I think it mostly depends on what the other NAMD folks will say. But if we can assume that the only atoms and forces that GlobalMasterColvars sees are the ones it requested, that would be an improvement. |
Given this open MR, we maybe wait a few days to merge this? Or merge it now (after including patches here to reproduce the NAMD changes that you're already submitting in that MR)? |
This PR currently doesn't completely solve the crash issue. Even after this one and https://gitlab.com/tcbgUIUC/namd/-/merge_requests/486 get merged, I still have no idea how to re-initialize the |
You mean that there currently is no callback to Colvars to state that a new (I can currently run that test successfully on my local machine even without the changes in your NAMD MR...) |
I'm not sure. Maybe you can make any one of
It does run with the default compilation flags with optimizations, but doesn't work if you use |
Yeah, that's definitely required, I'll do that. The flip side of it is catching the use case when the system's topology changes while Colvars is still defined. That may need a change in NAMD. |
8cc8ef9 to
1cdfe8d
Compare
|
The fix looks good! I wish I had seen the bug... Do you think a variant of the |
I have no earthly idea! I couldn't see the bug locally either, I just took @HanatoK's word for it... |
Let me try to push a commit to force the boundary check and call |
|
Hi @jhenin and @giacomofiorin ! Could you try bc27da1 locally? The ARM64 build currently skip the |
|
Indeed! This does reveal the crash, and therefore the fix. Now I wish I had an ARM64 test box to investigate - do you? |
I have one, but the issue now is that on ARM64, the total force is |
|
@jhenin I think this is done now. Take a look? |
HanatoK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall. Can you take a look at the GPU code path in ddd8277 and provide a way to reset the protected ft field there?
|
I have made a NAMD MR to pass the step information to the CUDAGM clients (https://gitlab.com/tcbgUIUC/namd/-/merge_requests/492). The corresponding Colvars PR in #899 could use the updated interface to invalidate the total forces for startup steps. |
I saw that commit, and thinking about it, the only way I see is separating out what I just added here: Lines 1647 to 1655 in ddd8277
into a dedicated function, and call it there. More in general, I see this as a side effect of the code duplication that we're starting to have (but we should try to reduce it) |
To be clear, I'll do that right now |
|
Could we add the check for valid total forces here instead? colvars/src/colvar_gpu_calc.cpp Lines 450 to 452 in ddd8277
Also, what are your thoughts about replacing |
I'm not quite sure. You can remove the |
Yes, that part is clear, you are replicating to the GPU-friendly code path the test for whether the total forces are available from the same step (feature flag: `f_cv_total_force_current_step), or only from the previous step. The latter is what we have currently (certainly so in NAMD), so the check for the "valid" flag is what we would want there. The condition Also to remind: can we maybe drop support for the |
Remove the check (see Colvars#887 (comment))
Thanks. I have removed the check in cc92c96 |
src/colvar.cpp
Outdated
| error_code |= calc_cvcs(); | ||
| if (!cvm::main()->proxy->total_forces_valid()) { | ||
| // Zero out the colvar total force when atomic total forces are not available | ||
| ft.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression is that the GPU code doesn't use any of colvar::calc and colvar::calc_cvcs. The entry point of the GPU code is in colvarmodule.cpp:
Lines 1091 to 1093 in 9f72f0d
| #if defined (COLVARS_CUDA) || defined (COLVARS_HIP) | |
| error_code |= gpu_calc->calc_cvs(*variables_active(), this); | |
| #else |
The idea of the GPU code there is flattening the whole tree-like structure in Colvars, constructing a single CUDA graphs from sub-graphs of all atom groups, and then computing all CVCs and their total forces. In other words, the serial code traverses the
colvardeps* tree by post-order, while the GPU code traverses it by something similar to level-order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it totally makes sense to reduce the GPU kernel load delay!
Somehow I thought that the host portion of that code was calling colvar::calc(), which it isn't. I added an accessor in the latest commit
a665e80 to
8e7eb2d
Compare
|
@giacomofiorin I realize that simply resetting the Lines 1674 to 1698 in 9f72f0d
The function |
Remove the check (see Colvars#887 (comment))
HanatoK
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with the latest commit!
I think you are right: the CPU code path was using This makes me realize that setting of the total force to zero is probably done most cleanly at the collection step... Small refactor coming in |
…orce is zero after collection
(2025-12-09: rewritten description - @giacomofiorin)
Fixes two bugs triggered by the (relatively infrequent) combination of features when total forces are being used in a simulation where the system's topology is changed mid-run:
nanin the output, due to a division by zero because not all Colvars properties had been set previously)Fixes #778