Skip to content

Conversation

jauy123
Copy link
Contributor

@jauy123 jauy123 commented Mar 3, 2025

Fixes #4907

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

  • Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
  • Added requests as a dependency
  • mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 3, 2025

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

  1. Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
  2. Integration with Mdanalysis' logger
  3. Probably could wrap the cache logic into a separate function

@BradyAJohnston BradyAJohnston self-assigned this Mar 4, 2025
@BradyAJohnston
Copy link
Member

BradyAJohnston commented Mar 4, 2025

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

@codecov
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.02%. Comparing base (5c7c480) to head (31a8f7b).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
+ Coverage    92.68%   93.02%   +0.34%     
===========================================
  Files          180      180              
  Lines        22452    22473      +21     
  Branches      3186     3190       +4     
===========================================
+ Hits         20809    20905      +96     
+ Misses        1169     1113      -56     
+ Partials       474      455      -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 5, 2025

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 10, 2025

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

@BradyAJohnston
Copy link
Member

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

try:
import Bio.AlignIO
import Bio.Align
import Bio.Align.Applications
except ImportError:
HAS_BIOPYTHON = False
else:
HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

Thanks for the comment!

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

@orbeckst
Copy link
Member

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

@orbeckst
Copy link
Member

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

@jauy123
Copy link
Contributor Author

jauy123 commented Jun 12, 2025

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

@jauy123
Copy link
Contributor Author

jauy123 commented Oct 21, 2025

@IAlibay @BradyAJohnston @orbeckst

I added the test, and github's machines are now returning 100% coverage. I also refilled out the changelog. I believe this PR should be okay to merge with the dev branch.

Copy link
Member

@p-j-smith p-j-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jauy123! I've made some non-blocking suggestions. The only blocking one is to explicitly document all supported formats for file_format in the docstring

Still need to add to docstring. sphinx, and probably also still
write test as well.
Copy link
Member

@p-j-smith p-j-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks @jauy123!

@jauy123 jauy123 requested a review from orbeckst October 22, 2025 20:10
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good in principle, but I have a few major requests:

  1. Have the output shape be the same as in the input shape, i.e., str -> str and list -> list and NOT "list of length 1" -> str, "list of length > 1" -> list.
  2. Update docs; you can accept my suggestions, for others you have to write a bit more. Look at it from a user's perspective: what do you want to/need to know?
  3. You need to follow our code style and you can use darker to format only your changes. Do not run black on all of the topology/PDBParser.py file because at the moment @RMeli has this file excluded from format checking (he wants to avoid collisions with active PRs). Do run black test_fetch_pdb.py because this is a completely new file.

See comments for other minor things.

Looking good, you're almost there!

Comment on lines 577 to 580
Download multiple PDB files and converting them to a universe:
>>> [mda.Universe(mda.fetch_pdb(PDB_ID), file_format="pdb.gz") for PDB_ID in ("1AKE", "4BWZ")]
[<Universe with 3816 atoms>, <Universe with 2824 atoms>]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is downloading one-by-one in a loop preferred over downloading all at once (shown above), i.e.,

universes = [mda.Universe(pdb) for pdb in mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True)]

If you want an example with multiple files then I'd use the one that uses fetch_pdb 's capability to download multiple files at once.

----------
PDB_IDS : str or sequence of str
A single PDB ID as a string, or a sequence of PDB IDs to fetch.
cache_path : str or pathlib.Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what the default None does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code I see that it enables a default cache. Need to explain that and how to manage the default cache. Possibly link to relevant pooch docs.

User would want to know where data are saved, how they can clean up the data, etc.

Returns
-------
str or list of str
The path(s) to the downloaded file(s). Returns a single string if one PDB ID is given,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when I ask for a list with a single ID?

fetch_pdb(["1ake"])

I hope it returns a list of length 1 as this will make it so much easier to write code that works seamlessly with any number of inputs.

Notes
-----
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use "here" for links — that's really bad for accessibility. Instead integrate into language:

Suggested change
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services.
This function uses the `RCSB File Download Services`_ for directly downloading
structure files via https.
.. _`RCSB File Download Services`:
https://www.rcsb.org/docs/programmatic-access/file-download-services

I wouldn't call it an API (the RCSB Web API is a real programmatic API).

Also, keep the text within 79 characters for easier readability.

-----
This function downloads using the API established here at https://www.rcsb.org/docs/programmatic-access/file-download-services.
Currently supported file formats (per the API) are ('cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be clear that not all of these formats can be read:

Suggested change
Currently supported file formats (per the API) are ('cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz')
The RCSB currently provides data in 'cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml',
'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz' file formats and can therefore be
downloaded. Not all of these formats can be currently read with MDAnalysis.

if isinstance(PDB_IDS, str):
PDB_IDS = (PDB_IDS,)

if cache_path is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document what happens for None

  • Where is the cache located? Perhaps link to pooch docs
  • How do I clear the cache?
  • How do I force re-fetching?

cache_path : str or pathlib.Path
Directory where downloaded file(s) will be cached.
file_format : str
The file extension/format to download (e.g., "cif", "pdb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to document allowed values.

Suggested change
The file extension/format to download (e.g., "cif", "pdb")
The file extension/format to download (e.g., "cif", "pdb").
See the Notes section below for a list of all supported file formats.

Comment on lines 634 to 640
if len(PDB_IDS) == 1:
return str(
downloader.fetch(
fname=tuple(registry_dictionary.keys())[0],
progressbar=progressbar,
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike this approach where a length 1 list always returns a string. It makes it very frustrating to write code that can process arbitrary lists because I have to check if I get back a single file, then it's a string, but if I gave multiple ones, it's a list. It's MUCH SANER if the ''output shape = input shape''.

  • If input is a str, make output a str.
  • If input is a sequence, make output a sequence.

Comment on lines 24 to 26
* Added new function `PDBParser.fetch_pdb` to download structure files from wwPDB
using `pooch` as optional dependency. This can be used to interactively create
`Universe` classes on the fly. (Issue #4907, PR #4943)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shorten

Suggested change
* Added new function `PDBParser.fetch_pdb` to download structure files from wwPDB
using `pooch` as optional dependency. This can be used to interactively create
`Universe` classes on the fly. (Issue #4907, PR #4943)
* Added function `topology.PDBParser.fetch_pdb` (accessible as
`MDAnalysis.fetch_pdb()`) to download structure files from wwPDB using
`pooch` as optional dependency (Issue #4907, PR #4943)

If you use a length 1 iterator for pdb_id, you get back a rank 1 list
Docstring changed
Shortened really long comment in fetch_pdb
Comment on lines 651 to 663
if type(pdb_ids) is str:
return str(
downloader.fetch(
fname=tuple(registry_dictionary.keys())[0],
progressbar=progressbar,
)
)
else:
return [
str(
downloader.fetch(fname=file_name, progressbar=progressbar)
for file_name in registry_dictionary.keys()
)
Copy link
Member

@orbeckst orbeckst Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to simplify the code so that you only downloader.fetch(fname=...) once? Along the lines of

paths = [downloader.fetch(fname=file_name, progressbar=progressbar) for file_name in registry_dictionary.keys()]
if type(pdb_ids) is str:
   return paths[0]
return paths

or even more concise with the Python ternary operator:

paths = [downloader.fetch(fname=file_name, progressbar=progressbar) for file_name in registry_dictionary.keys()]
return paths if type(pdb_ids) is not str else paths[0]

class TestDocstringExamples:
"""This class tests all the examples found in fetch_pdb's docstring"""

@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")])
Copy link
Member

@orbeckst orbeckst Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test a list with a single entry ["1AKE"]. Check that it returns a list of paths.

list_of_path_strings = mda.fetch_pdb(
["1AKE", "4BWZ"], cache_path=tmp_path, progressbar=True
)
assert all(isinstance(PDB_ID, str) for PDB_ID in list_of_path_strings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test the actual names not just types

reason="Pooch is installed.",
)
def test_pooch_installation(tmp_path):
with pytest.raises(ModuleNotFoundError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check the error message of the exception, there's an additional argument like

with pytest.raises(ModuleNotFoundError, match="pooch is needed as a dependency for fetch_pdb()"):

)
@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")])
def test_no_cache_path(pdb_id):
assert isinstance(mda.fetch_pdb(pdb_id, cache_path=None), str)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good practice to clean up the default cache afterwards because if I run this test myself, the cache persists somewhere in my system. Tests should not leave anything behind.

paths = [downloader.fetch(fname=file_name, progressbar=progressbar)
for file_name in registry_dictionary.keys()]

return paths if type(pdb_ids) is not str else paths[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not isinstance(pdb_ids, str) as you used above is probably safer than using type.


if HAS_POOCH:
from requests.exceptions import HTTPError
from pooch import os_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid importing into the namespace. When reading code below it's not clear where os_cache is from. It's much clearer to just import pooch and then later use pooch.os_cache.

class TestDocstringExamples:
"""This class tests all the examples found in fetch_pdb's docstring"""

@pytest.mark.parametrize("pdb_id", [("1AKE"), ("4BWZ")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parentheses don't do anything so remove:

@pytest.mark.parametrize("pdb_id", ["1AKE", "4BWZ"])

mda.fetch_pdb(
pdb_ids="1AKE", cache_path=tmp_path, file_format="barfoo"
)
@pytest.fixture()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice to create it as a fixture

@pytest.fixture()
def clean_up_default_cache():
yield
rmtree(os_cache("MDAnalysis_pdbs"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to work one must know that "MDAnalysis_pdbs" is hard-coded into fetch_pdb. It would be safer to define the name in PDBParser.py ad

DEFAULT_POOCH_CACHE_NAME = "MDAnalysis_pdbs"

and then use DEFAULT_POOCH_CACHE_NAME everywhere else. In particular, here you can then use

import MDAnalysis.topology.PDBParser
...
def clean_up_default_cache():
    yield
    rmtree(os_cache(MDAnalysis.topology.PDBParser.DEFAULT_POOCH_CACHE_NAME))

More typing initially but cleaner and more robust because the hard-coded name exists in exactly one location only.

from MDAnalysis.topology.PDBParser import HAS_POOCH

from urllib import request
from shutil import rmtree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd import the packages instead of functions.

(However, these are all standard library packages so people may know where rmtree etc come from so I'll leave it to you.)

class TestExpectedErrors:

def test_invalid_pdb(self, tmp_path):
with pytest.raises(HTTPError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match="..." error output (when you test an exception you want to make sure it is "your" exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, this wouldn't be my exception. It's requests.exceptions.HTTPError - a third party pooch dependency. I don't think I would need to match in that case?

mda.fetch_pdb(pdb_ids="foobar", cache_path=tmp_path)

def test_invalid_file_format(self, tmp_path):
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match="..." error output (when you test an exception you want to make sure it is "your" exception)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mda.fetch_pdb() to generate Universe from Protein Databank structures

6 participants