Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow loading computationally predicted cif files, which may not have certain custom annotations #670

Merged

Conversation

Croydon-Brixton
Copy link
Contributor

When loading cif files, it is convenient to specify extra fields to be loaded (b_factor, occupancy, charge) without having to first inspect the atom_site record to see whether they exist.

This PR implements this functionality by providing fallback default values for b_factor (nan) and occupancy (1.0) in cases where no annotation is given but the extra field is nontheless required.

It also warns when formal charge is inferred, which currently happens silently assuming a fully uncharged structure charge=0 everywhere.

Copy link

codspeed-hq bot commented Oct 4, 2024

CodSpeed Performance Report

Merging #670 will not alter performance

Comparing Croydon-Brixton:feat/cif_allow_missing_extra_fields (9b235e1) with main (4ca6318)

Summary

✅ 45 untouched benchmarks

@padix-key
Copy link
Member

Thanks for the effort. However, I am somewhat divided regarding this feature.

Pro:

  • Convenience, especially when scraping a lot of files
  • Sensible defaults

Contra:

  • Warnings can be easily overlooked, so a user might not see that the values are fabricated
  • Inconsistency with the non-special extra fields, as this does not work for them

@biotite-dev Do you have opinions about this?

@nscorley
Copy link
Contributor

nscorley commented Oct 8, 2024

Thanks for the effort. However, I am somewhat divided regarding this feature.

Pro:

  • Convenience, especially when scraping a lot of files
  • Sensible defaults

Contra:

  • Warnings can be easily overlooked, so a user might not see that the values are fabricated
  • Inconsistency with the non-special extra fields, as this does not work for them

@biotite-dev Do you have opinions about this?

Providing some additional context why this feature may be helpful in practice:

  • During protein engineering workflows, protein designers often obtain predicted structure files in the form of PDB's or bare-bones CIF files from AlphaFold-2, RoseTTAFold All-Atom, or other deep-learning tools
  • These computationally-predicted files "only" contain fully-resolved atoms and residues, and thus may not include an "occupancy" or "b-factor" column (in fact, AF-2 fills the pLDDT metric into the b-factor column)
  • Ability to load these files despite not having some of RCSB's expected fields would allow Biotite to better generalize to repositories of structural data outside the Protein Data Bank

@JHKru
Copy link
Member

JHKru commented Oct 9, 2024

Hi, I would also lean towards sensible defaults here.
Maybe an additional optional argument could be added for strict behaviour/errors instead of warnings, if a non-existant special field is requested with extra_fields (something like strict_special_categories=False as standard case)?
Might bloat the arguments a bit, but maybe there are use cases, where the current behaviour is beneficial.

@Croydon-Brixton
Copy link
Contributor Author

Hi, I would also lean towards sensible defaults here. Maybe an additional optional argument could be added for strict behaviour/errors instead of warnings, if a non-existant special field is requested with extra_fields (something like strict_special_categories=False as standard case)? Might bloat the arguments a bit, but maybe there are use cases, where the current behaviour is beneficial.

That sounds like a good option to me - if you agree @padix-key , I'm happy to update the MR accordingly

@padix-key
Copy link
Member

padix-key commented Oct 12, 2024

Maybe an additional optional argument could be added for strict behaviour/errors instead of warnings

I am inclined towards keeping the parameter list concise, especially as get_structure() already has quite a bunch. In the end the strict behavior can be enforced the same way, the user would have to go to check for available extra fields right now (without this PR): by checking for column names in pdbx_file.block["atom_site"].keys().

As I have no strong opinion for or against this new behavior here and as the small user base sample in this PR 😉 clearly votes for sensible defaults, I think this is a good way to go 👍.

As mentioned in #668, we have to wait for #661 to fix the CI.

@padix-key
Copy link
Member

Can you run ruff over the code? The code style CI check fails currently.

@Croydon-Brixton Croydon-Brixton force-pushed the feat/cif_allow_missing_extra_fields branch from 2a10ccb to 9b235e1 Compare October 16, 2024 17:04
@padix-key padix-key merged commit a384459 into biotite-dev:main Oct 19, 2024
27 checks passed
@padix-key
Copy link
Member

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants