-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Saturated Ring Sampling Conformation Feature #34
base: master
Are you sure you want to change the base?
Conversation
easydock/vina_dock.py
Outdated
sys.stderr.write('STDERR output:\n') | ||
sys.stderr.write(e.stderr + '\n') | ||
sys.stderr.flush() | ||
output = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is already not necessary and can be deleted. If docking was unsuccessful it will not contribute to the output_list
easydock/preparation_for_docking.py
Outdated
for i in atomIds: | ||
d = conf1.GetAtomPosition(i).Distance(conf2.GetAtomPosition(i)) | ||
ssr += d * d | ||
ssr /= mol.GetNumAtoms() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ssr /= len(atomsIds)
easydock/preparation_for_docking.py
Outdated
for i in atomIds: | ||
d = conf1.GetAtomPosition(i).Distance(conf2.GetAtomPosition(i)) | ||
ssr += d * d | ||
ssr /= mol.GetNumAtoms() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add if statement to split the code execution and cover the previous behavior
if atomsIds:
# new code
else:
# old code
easydock/preparation_for_docking.py
Outdated
pdbqt_string, is_ok, error_msg = PDBQTWriterLegacy.write_string(setup) | ||
if not is_ok: | ||
print(f"{mol.GetProp('_Name')} has error in converting to pdbqt: {error_msg}") | ||
pdbqt_string_list.append(pdbqt_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be pdbqt_string
if not is_ok
? Is it reasonable to add this value to the list? Need to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the implicit hydrogen error, the mk_prepare_ligand
return an empty string ''
. So, is it better to not append the empty string and let the program continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is not reasonable. Then it can be treated in this way below. If one conformer will fail, maybe some other will pass, so no need to interrupt preparation. In the worst case we will return an empty list.
if not is_ok:
print(f"{mol.GetProp('_Name')} has error in converting to pdbqt: {error_msg}")
else:
pdbqt_string_list.append(pdbqt_string)
easydock/preparation_for_docking.py
Outdated
for ring in ssr: | ||
is_atom_saturated_array = np.array([atom_list[atom_id].GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring]) | ||
is_ring_unsaturated = np.any(np.nonzero(is_atom_saturated_array==0)) | ||
if is_ring_unsaturated: | ||
continue | ||
|
||
saturated_ring_list.append(ring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ring in ssr:
is_atom_saturated_array = [mol.GetAtomWithIdx(atom_id).GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring])
if any(is_atom_saturated_array):
saturated_ring_list.append(ring)
This seems more compact and expressive
easydock/preparation_for_docking.py
Outdated
def mol_embedding_3d(mol: Chem.Mol, seed: int=43) -> Chem.Mol: | ||
|
||
def find_saturated_ring(mol: Chem.Mol) -> list[list[int]]: | ||
atom_list = mol.GetAtoms() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atom_list
var will be no longer required, the line can be deleted
easydock/preparation_for_docking.py
Outdated
if not isinstance(mol, Chem.Mol): | ||
return None | ||
|
||
saturated_ring = find_saturated_ring(mol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename the variable and the function and add s
at the end of both names to designate that the function returns multiple objects and the variable stores multiple objects
|
For number one, say we have ring A and B, each with two ring conformations (A1, A2) and (B1, B2). If we have a molecule with four conformation, excluding similar conformation due to the linker (A1-B1, A1-B2, A2-B1, A2-B2), do we take all of the conformers because each has different conformers? Or do we take only two (either A1-B1 and A2-B2 or A1-B2 and A2-B1), because they compromise the four conformers? I assume the issue is with the averaging the matrix and not generating the matrix based on the ring number, right? |
Also for number four, I think I get your explanation, but can that accidentally remove the whole conformation? For example, out of 7 conformers [0...6], can |
For number 4. We will pass to |
Effectively at the step |
The number 1 is a difficult question.
Probably it will be a good idea to ask rdkit mailing list. As an alternative we may try Solution 1 is appropriate, although not perfect. so we may implement it as a beginning. |
Yeah, it seems that the second solution is quite troublesome. We can start with solution 1 first, then move to the second if needed. I have tried to interpret point number four. Can you check if I have understood it correctly? |
easydock/preparation_for_docking.py
Outdated
|
||
cmat_list_array = np.array(cmat_list) | ||
|
||
return list(np.mean(cmat_list_array, axis=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strong in numpy to write the code directly from the top of my head. So, I'll explain. If cmat_list_array
is a matrix A of size N x M, where N is a number of conformer pairs and M is the number of rings. You have to compute A^2 for every element, then multiply each column on the corresponding number of atoms in ring 1..M, sum over rows, divide on the sum of all ring atoms in a molecule and take a square root. This would be mathematically exact.
Simple averaging will work only if all rings have the same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still confused, but do you want to calculate the mathematical RMSD based on the total number of atoms in the ring? Say we have a five- and six- membered ring. Instead of calculating:
We should calculate it this way to reflect the actual RMSD:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I propose to calculate true RMSD which will account atoms in all ring systems - the latter equation. However, we have RMSD for individual rings only.
and
Therefore, I suggest to recalculate them into the final RMSD by simple math operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright. I have edited the code to calculate the true RMSD based on the rdkit RMSD
easydock/preparation_for_docking.py
Outdated
saturated_ring_list = [] | ||
for ring in ssr: | ||
is_atom_saturated_array = [mol.GetAtomWithIdx(atom_id).GetHybridization() == Chem.HybridizationType.SP3 for atom_id in ring] | ||
if all(is_atom_saturated_array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any
is more appropriate then all
, because this will skip any saturated ring fused with an aromatic ring. If the conformations of such a ring will be very similar they will be cut by rms
filter. So, it looks safe to use less strict conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah in that case, we should make it less stricter
easydock/preparation_for_docking.py
Outdated
if keep_nconf: | ||
if mol.GetNumConformers() <= keep_nconf: | ||
return mol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified
if keep_nconf and mol.GetNumConformers() > keep_nconf:
...
easydock/preparation_for_docking.py
Outdated
return mol | ||
|
||
cids = [c.GetId() for c in mol.GetConformers()] | ||
arr = arr[np.ix_(cids, cids)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential issue. Conformer ids are arbitrary, they do not guarantee to be sequential from 0 to N-1 as atoms in RDKit. Thus, to be safe, another indexing should be used, probably involving keep_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, instead of cids
, we can use keep_ids
as the reference to be removed for the nkeep filters? Something like this?
arr = arr[np.ix_(keep_ids, keep_ids)]
keep_ids_nconf_filter = []
#calculation
remove_ids = set(keep_ids) - set(keep_ids_nconf_filter)
test_conformer2.zip |
You were right with these corrections, I really lost with indexing.
|
Yes, you are completely right. In this case on the first iteration you select cid This iterative strategy can be implemented instead of the suggested two-step approach. However, the two-step procedure looks a little bit more reasonable and should be faster. On the first step we quickly identify a representative subset of conformers and after that we reduce redundancy to a given level. |
Okay, I got what you mean by the filter. I'll try to code the feature later. For the meantime, here is the rms result. I didn't try >2, because at |
…ormer with highest rms, that conformer is also removed
I looked through the conformers and my opinion is that We may combine To test how this combination will work we may use betulinic acid. It has 5 saturated ring systems, but the structure is very rigid and should result in a very small number of conformers. Do you see some issues with this filtration approach? |
I don't particularly see much problem with the keep_nconf filtration approach. It's probably won't use too much because the "below_rms_filter" will probably reduce the I'm not sure how I should count the ring size, since the conformer_belowrmsfilter_and_keepnconf.zip Here is the |
The results with and without
What you suggest is a right way. First, this will avoid code duplication. Second, it will bring a small speed up. Please also remove hard coded saving to a file before the merge) |
For chondroitin, the conformers with both axial or equitorial position conformer seen in the
|
Thank you. So maybe this filtration is not worth? However, if disable it, there will be output conformers with RMSD below the threshold, that can be considered an unexpected behavior. Let's try to disable this filtration step and repeat tests. How conformers for other (single ring) compounds will look like, will they be as diverse as for the current version. You may test one more implementation. We may replace the first clustering with this iterative procedure. Could you test this hypothesis as well. |
conformer_without_clustering_and_iterative.zip |
Difficult to decide which one is better. Without iterative step it is better for chrondroitin but worse for oxepane. Without clustering it is vice versa. However, the output for oxepane looks strange. Clustering only gives 1 conformer, iterative approach alone gives 3. We use complete linkage clustering, that means that all conformers in a cluster has at most distance 1A. Since we did not remove conformers after clustering, that means that there was only one cluster for exapane where all conformers differed less than 1A from any other. But in the case when only iterative procedure was applied there were at least three conformers differed greater than 1A. Could you please check this? Maybe I'm wrong in my reasoning. |
For thiepane and oxepane in iterative process, the three conformers exist because it skipped the iterative process, and immediately filtered through the #sometimes clustering result gives matrix < rms when rms is high enough
if all(arr[arr != 0] < rms) or not any(arr[arr != 0] < rms):
break Here is the arr for both thiepane and oxepane: [For Testing Only] oxepane_0 has 1 saturated ring
[For Testing Only] Before removing conformation: oxepane_0 has 100 conf
[[0. 0.61248338 0.58374022]
[0.61248338 0. 0.51815107]
[0.58374022 0.51815107 0. ]]
[For Testing Only] After removing conformation: oxepane_0 has 3 conf
conformer_without_clustering/oxepane_0_after_remove_100.sdf
[For Testing Only] thiepane_unsaturated_0 has 1 saturated ring
[For Testing Only] Before removing conformation: thiepane_unsaturated_0 has 100 conf
[[0. 0.6227772 0.25156321]
[0.6227772 0. 0.39474366]
[0.25156321 0.39474366 0. ]]
[For Testing Only] After removing conformation: thiepane_unsaturated_0 has 3 conf |
Thank you! |
Yepp, that should be fine. I think the three conformers generated is also very similar because of the symmetries, so the docking pose should be very similar to each other. |
Agree, then please finalize the code to merge it and test. |
should I remove the sanity check print also before merging it? |
Not necessary, I expect that I'll go through the code and will remove it later. Thank you! |
my bad I missed the review for the |
We made preliminary tests and the results were not too encouraged. Docking of some strange ring conformations can be much more favorable than the native conformation. So, to answer the question we have to perform a more systematic study.
All these will require time and I'm not sure that we have this time right now. |
Alright. Then, let me know if there is anything that I can do or when we can continue implementing the feature. I have tried testing betulinic acid with two co-crystallised protein structure, and one of them does not look promising to me (5LSG). The other one (8GXP) shows the same conformation as the crystallised structure, given the correct isomer. |
Implement Saturated Ring Sampling Conformation as discussed in the Issues #33 .
test_ring_conformer.zip
Note that:
remove_confs_rms(mol)
function outside of themol_embedding_3d(mol)
function, so I just put it inside for now.numConfs
value as default in theEmbedMultipleConfs
, but I have to specify the number in the function. So, I put 10 as its default value. I am not sure what value we should put for this parameter.rms=0.25
. I am not sure how to do the subset of conformers withkeep_nconf
afterrms
criterion. Is it the below code?remove_confs_rms(mol)
is executed afterUFFOptimizeMolecule(mol)
if I interpret it correctly.remove_confs_rms(mol)
to returnmol
instead ofmol_tmp
because themk_prepare_ligand(mol)
complained about the implicit hydrogen, which I assume is because of themol_tmp = Chem.RemoveHs(mol)
run_dock -i test_saturated_ring.smi -o test_ring_conformer.db --config config.yml -s 2 --protonation pkasolver --sdf --program vina
. Since the sugar gives too many isomers, I just limited it to two