-
Notifications
You must be signed in to change notification settings - Fork 733
Implementation of fetch_pdb() #4943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere Some stuff which I like to still add (besides tests and docs):
|
I think others will have to confirm, but likely we'll want to have Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4943 +/- ##
===========================================
+ Coverage 92.68% 93.02% +0.34%
===========================================
Files 180 180
Lines 22452 22473 +21
Branches 3186 3190 +4
===========================================
+ Hits 20809 20905 +96
+ Misses 1169 1113 -56
+ Partials 474 455 -19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I'm ok with that. I can make the code raise an exception if |
Assuming that |
You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython: mdanalysis/package/MDAnalysis/analysis/align.py Lines 200 to 207 in dcaa087
I'm not an expert on the pipelines so someone else would have to pitch in more on that. |
Thanks for the comment! |
I happen to have another question! Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit. The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours. Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)
|
In principle, tests should pass everywhere. The Azure tests time out in the test
which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows. |
@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10! |
I have time again. I was busy starting the end of spring break with comps, classes, and you know what. |
…since text files are just binary files with special encoding
@IAlibay @BradyAJohnston @orbeckst I added the test, and github's machines are now returning 100% coverage. I also refilled out the changelog. I believe this PR should be okay to merge with the dev branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @jauy123! I've made some non-blocking suggestions. The only blocking one is to explicitly document all supported formats for file_format
in the docstring
Still need to add to docstring. sphinx, and probably also still write test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks @jauy123!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good in principle, but I have a few major requests:
- Have the output shape be the same as in the input shape, i.e., str -> str and list -> list and NOT "list of length 1" -> str, "list of length > 1" -> list.
- Update docs; you can accept my suggestions, for others you have to write a bit more. Look at it from a user's perspective: what do you want to/need to know?
- You need to follow our code style and you can use
darker
to format only your changes. Do not runblack
on all of the topology/PDBParser.py file because at the moment @RMeli has this file excluded from format checking (he wants to avoid collisions with active PRs). Do runblack test_fetch_pdb.py
because this is a completely new file.
See comments for other minor things.
Looking good, you're almost there!
Download multiple PDB files and converting them to a universe: | ||
>>> [mda.Universe(mda.fetch_pdb(PDB_ID), file_format="pdb.gz") for PDB_ID in ("1AKE", "4BWZ")] | ||
[<Universe with 3816 atoms>, <Universe with 2824 atoms>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is downloading one-by-one in a loop preferred over downloading all at once (shown above), i.e.,
universes = [mda.Universe(pdb) for pdb in mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True)]
If you want an example with multiple files then I'd use the one that uses fetch_pdb
's capability to download multiple files at once.
---------- | ||
PDB_IDS : str or sequence of str | ||
A single PDB ID as a string, or a sequence of PDB IDs to fetch. | ||
cache_path : str or pathlib.Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what the default None
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the code I see that it enables a default cache. Need to explain that and how to manage the default cache. Possibly link to relevant pooch docs.
User would want to know where data are saved, how they can clean up the data, etc.
Returns | ||
------- | ||
str or list of str | ||
The path(s) to the downloaded file(s). Returns a single string if one PDB ID is given, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when I ask for a list with a single ID?
fetch_pdb(["1ake"])
I hope it returns a list of length 1 as this will make it so much easier to write code that works seamlessly with any number of inputs.
Notes | ||
----- | ||
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use "here" for links — that's really bad for accessibility. Instead integrate into language:
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services. | |
This function uses the `RCSB File Download Services`_ for directly downloading | |
structure files via https. | |
.. _`RCSB File Download Services`: | |
https://www.rcsb.org/docs/programmatic-access/file-download-services |
I wouldn't call it an API (the RCSB Web API is a real programmatic API).
Also, keep the text within 79 characters for easier readability.
----- | ||
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services. | ||
Currently supported file formats (per the API) are ('cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be clear that not all of these formats can be read:
Currently supported file formats (per the API) are ('cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz') | |
The RCSB currently provides data in 'cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', | |
'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz' file formats and can therefore be | |
downloaded. Not all of these formats can be currently read with MDAnalysis. |
if isinstance(PDB_IDS, str): | ||
PDB_IDS = (PDB_IDS,) | ||
|
||
if cache_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document what happens for None
- Where is the cache located? Perhaps link to pooch docs
- How do I clear the cache?
- How do I force re-fetching?
cache_path : str or pathlib.Path | ||
Directory where downloaded file(s) will be cached. | ||
file_format : str | ||
The file extension/format to download (e.g., "cif", "pdb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to document allowed values.
The file extension/format to download (e.g., "cif", "pdb") | |
The file extension/format to download (e.g., "cif", "pdb"). | |
See the Notes section below for a list of all supported file formats. |
if len(PDB_IDS) == 1: | ||
return str( | ||
downloader.fetch( | ||
fname=tuple(registry_dictionary.keys())[0], | ||
progressbar=progressbar, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike this approach where a length 1 list always returns a string. It makes it very frustrating to write code that can process arbitrary lists because I have to check if I get back a single file, then it's a string, but if I gave multiple ones, it's a list. It's MUCH SANER if the ''output shape = input shape''.
- If input is a str, make output a str.
- If input is a sequence, make output a sequence.
package/CHANGELOG
Outdated
* Added new function `PDBParser.fetch_pdb` to download structure files from wwPDB | ||
using `pooch` as optional dependency. This can be used to interactively create | ||
`Universe` classes on the fly. (Issue #4907, PR #4943) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and shorten
* Added new function `PDBParser.fetch_pdb` to download structure files from wwPDB | |
using `pooch` as optional dependency. This can be used to interactively create | |
`Universe` classes on the fly. (Issue #4907, PR #4943) | |
* Added function `topology.PDBParser.fetch_pdb` (accessible as | |
`MDAnalysis.fetch_pdb()`) to download structure files from wwPDB using | |
`pooch` as optional dependency (Issue #4907, PR #4943) |
If you use a length 1 iterator for pdb_id, you get back a rank 1 list
Docstring changed Shortened really long comment in fetch_pdb
if type(pdb_ids) is str: | ||
return str( | ||
downloader.fetch( | ||
fname=tuple(registry_dictionary.keys())[0], | ||
progressbar=progressbar, | ||
) | ||
) | ||
else: | ||
return [ | ||
str( | ||
downloader.fetch(fname=file_name, progressbar=progressbar) | ||
for file_name in registry_dictionary.keys() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to simplify the code so that you only downloader.fetch(fname=...)
once? Along the lines of
paths = [downloader.fetch(fname=file_name, progressbar=progressbar) for file_name in registry_dictionary.keys()]
if type(pdb_ids) is str:
return paths[0]
return paths
or even more concise with the Python ternary operator:
paths = [downloader.fetch(fname=file_name, progressbar=progressbar) for file_name in registry_dictionary.keys()]
return paths if type(pdb_ids) is not str else paths[0]
class TestDocstringExamples: | ||
"""This class tests all the examples found in fetch_pdb's docstring""" | ||
|
||
@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test a list with a single entry ["1AKE"]
. Check that it returns a list of paths.
list_of_path_strings = mda.fetch_pdb( | ||
["1AKE", "4BWZ"], cache_path=tmp_path, progressbar=True | ||
) | ||
assert all(isinstance(PDB_ID, str) for PDB_ID in list_of_path_strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test the actual names not just types
reason="Pooch is installed.", | ||
) | ||
def test_pooch_installation(tmp_path): | ||
with pytest.raises(ModuleNotFoundError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check the error message of the exception, there's an additional argument like
with pytest.raises(ModuleNotFoundError, match="pooch is needed as a dependency for fetch_pdb()"):
) | ||
@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")]) | ||
def test_no_cache_path(pdb_id): | ||
assert isinstance(mda.fetch_pdb(pdb_id, cache_path=None), str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good practice to clean up the default cache afterwards because if I run this test myself, the cache persists somewhere in my system. Tests should not leave anything behind.
paths = [downloader.fetch(fname=file_name, progressbar=progressbar) | ||
for file_name in registry_dictionary.keys()] | ||
|
||
return paths if type(pdb_ids) is not str else paths[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not isinstance(pdb_ids, str)
as you used above is probably safer than using type
.
|
||
if HAS_POOCH: | ||
from requests.exceptions import HTTPError | ||
from pooch import os_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid importing into the namespace. When reading code below it's not clear where os_cache is from. It's much clearer to just import pooch
and then later use pooch.os_cache
.
class TestDocstringExamples: | ||
"""This class tests all the examples found in fetch_pdb's docstring""" | ||
|
||
@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentheses don't do anything so remove:
@pytest.mark.parametrize("pdb_id", ["1AKE", "4BWZ"])
mda.fetch_pdb( | ||
pdb_ids="1AKE", cache_path=tmp_path, file_format="barfoo" | ||
) | ||
@pytest.fixture() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice to create it as a fixture
@pytest.fixture() | ||
def clean_up_default_cache(): | ||
yield | ||
rmtree(os_cache("MDAnalysis_pdbs")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to work one must know that "MDAnalysis_pdbs" is hard-coded into fetch_pdb
. It would be safer to define the name in PDBParser.py
ad
DEFAULT_POOCH_CACHE_NAME = "MDAnalysis_pdbs"
and then use DEFAULT_POOCH_CACHE_NAME
everywhere else. In particular, here you can then use
import MDAnalysis.topology.PDBParser
...
def clean_up_default_cache():
yield
rmtree(os_cache(MDAnalysis.topology.PDBParser.DEFAULT_POOCH_CACHE_NAME))
More typing initially but cleaner and more robust because the hard-coded name exists in exactly one location only.
from MDAnalysis.topology.PDBParser import HAS_POOCH | ||
|
||
from urllib import request | ||
from shutil import rmtree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd import the packages instead of functions.
(However, these are all standard library packages so people may know where rmtree etc come from so I'll leave it to you.)
class TestExpectedErrors: | ||
|
||
def test_invalid_pdb(self, tmp_path): | ||
with pytest.raises(HTTPError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match="..." error output (when you test an exception you want to make sure it is "your" exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case, this wouldn't be my exception. It's requests.exceptions.HTTPError
- a third party pooch dependency. I don't think I would need to match in that case?
mda.fetch_pdb(pdb_ids="foobar", cache_path=tmp_path) | ||
|
||
def test_invalid_file_format(self, tmp_path): | ||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match="..." error output (when you test an exception you want to make sure it is "your" exception)
Fixes #4907
Changes made in this Pull Request:
This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.
DownloaderBase
and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.requests
as a dependencymda.fetch_pdb()
is implemented as a wrapper to commonly used option in 'PDBDownloader'PR Checklist
package/CHANGELOG
file updated?Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/