Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use rdkit internal PDBResidueInfo for standard annotations #742

Conversation

Croydon-Brixton
Copy link
Contributor

@Croydon-Brixton Croydon-Brixton commented Jan 27, 2025

Here's an attempt at grouping the standard annotations under the appropriate RDKit annotation as discussed here #736 (comment)

Let me know what you think -- what I'm not 100% happy with yet is that
(1) python-iterating over atoms seems inefficient, I wonder if there's an array-like API in RDKit and
(2) I'm currently not tracking whether an annotation was explicitly set because it was present in the original AtomArray or whether it's filling in the default value. Doing this seems non-trivial for rdkit methods which change the underlying mol -- unless maybe we add a bool property that tracks if an annotation exists?

@Croydon-Brixton
Copy link
Contributor Author

Croydon-Brixton commented Jan 27, 2025

@greglandrum, do you have a hint on the above by any chance?

Copy link

codspeed-hq bot commented Jan 27, 2025

CodSpeed Performance Report

Merging #742 will not alter performance

Comparing Croydon-Brixton:feat/rdkit_native_annotations (13a7931) with main (4d02bb8)

Summary

✅ 59 untouched benchmarks

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this refactoring. I left a few questions in the comments, but overall this looks very good to me. Regarding your two questions:

  1. I agree this is probably not ideal, but I suppose this interface will be mainly used to convert molecules with only a few atoms. Hence, the performance penalty of iterating over atoms should not be that big.
  2. Just to clarify: You mean this would be a problem for optional annotations, e.g. I would always get a value for GetOccupancy() independent of whether it was set by as_mol() or not?

@greglandrum
Copy link

@greglandrum, do you have a hint on the above by any chance?

@Croydon-Brixton I would be happy to help, but I don't have time to wade through the issue discussion and your code to try and figure out what you want to do. If you have a specific RDKit question, I would be happy to try an answer it

@Croydon-Brixton
Copy link
Contributor Author

Thanks for taking care of this refactoring. I left a few questions in the comments, but overall this looks very good to me. Regarding your two questions:

  1. I agree this is probably not ideal, but I suppose this interface will be mainly used to convert molecules with only a few atoms. Hence, the performance penalty of iterating over atoms should not be that big.
  2. Just to clarify: You mean this would be a problem for optional annotations, e.g. I would always get a value for GetOccupancy() independent of whether it was set by as_mol() or not?

Re 1) I agree, it's not terribly slow and there doesn't seem to be an easy way around it.
Re 2) Yes exactly, that's what I meant. The previous way of doing it just looked at the manually added extra annotations and added those were they were present, but the PDBInfo provides default values even if not were set. We could engineer around it by setting 'protected' boolean property flags of whether a value was set, but that feels somewhat inelegant. Do you have any thoughts? Or do you think it's not needed?

@Croydon-Brixton
Copy link
Contributor Author

Croydon-Brixton commented Jan 30, 2025

@greglandrum, do you have a hint on the above by any chance?

@Croydon-Brixton I would be happy to help, but I don't have time to wade through the issue discussion and your code to try and figure out what you want to do. If you have a specific RDKit question, I would be happy to try an answer it

Thank you! Yes the specific question would be the following:

Currently we for-loop iterate over numpy arrays (the atoms.element, atoms.charge, ...) to set the atom attributes of RDKit atoms from python. Is there an array-like interface where we could instead directly set the array of values (avoiding the python loop)?

Here's a minimal example snippet

...
    for i in range(atoms.array_length()):
        rdkit_atom = Atom(atoms.element[i].capitalize())
        rdkit_atom.SetFormalCharge(atoms.charge[i].item())
        rdkit_atom_res_info = AtomPDBResidueInfo(
            atomName=atoms.atom_name[i].item(),
        )
        rdkit_atom.SetPDBResidueInfo(rdkit_atom_res_info)
        mol.AddAtom(rdkit_atom)
 ...

@padix-key
Copy link
Member

padix-key commented Jan 30, 2025

We could engineer around it by setting 'protected' boolean property flags of whether a value was set, but that feels somewhat inelegant. Do you have any thoughts? Or do you think it's not needed?

I think it would be OK then to always add these annotations to the AtomArray returned by from_mol(). I think there is no big drawback of having them there, except some performance penalty, which should be small for small molecules.

Copy link
Member

@padix-key padix-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment, but otherwise this is good to go from my side.

@Croydon-Brixton
Copy link
Contributor Author

Croydon-Brixton commented Jan 31, 2025

Sounds good -- I removed the atom naming part as suggested. Good to go from my side too now!

@padix-key
Copy link
Member

Ah, the doctest needs to be updated as well, hence the current CI failure. The current doctest still uses GetProp() to print the name for each atom

@padix-key
Copy link
Member

From my perspective we can address potential performance improvements later. Feel free to merge, if you are ready

@Croydon-Brixton
Copy link
Contributor Author

Sounds good to me, merging now!

@Croydon-Brixton Croydon-Brixton merged commit 4c7204a into biotite-dev:interfaces Feb 2, 2025
28 checks passed
@Croydon-Brixton Croydon-Brixton deleted the feat/rdkit_native_annotations branch February 2, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants