Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor datasets.py and update corresponding unit tests/tutorial notebook #89

Merged
merged 39 commits into from
May 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
85d156a
Issue #74: Remove OSFUpload class, change Project class name to OSFPr…
bongjinkoo May 10, 2022
3b8aa92
Issue #74: Add osfclient to environment.yml.
bongjinkoo May 10, 2022
80866de
Issue #74: Update black version in .pre-commit-config.yaml.
bongjinkoo May 10, 2022
39344f7
Issue #74: Add a delete file function in datasets.py. Add method args…
bongjinkoo May 13, 2022
f7e9e8f
Issue #74: Update errors from ValueError to TypeError in datasets.py
bongjinkoo May 13, 2022
5027232
Issue #74: Add a test_upload file and finish up datasets.py, test_dat…
bongjinkoo May 14, 2022
8a7dbb3
Issue #74: Update download_and_upload_with_osf.ipynb.
bongjinkoo May 15, 2022
017d75d
Issue #74: Fix a print string error in test_datasets.py.
bongjinkoo May 15, 2022
9035d6c
Issue #74: Remove an unnecessary comment in datasets.py.
bongjinkoo May 15, 2022
f0898a8
Issue #74: Update environment.yml for osfclient.
bongjinkoo May 15, 2022
4be65f0
Issue #74: Testing unit tests for osfclient.
bongjinkoo May 15, 2022
7f2d9ed
Issue #74: Testing test_datasets.py. Using subprocess instead of os.s…
bongjinkoo May 16, 2022
3cd92d2
Issue #74: Testing test_datasets.py, using subprocess.run().
bongjinkoo May 16, 2022
9583821
Issue #74: Testing test_datasets.py, add logging.
bongjinkoo May 16, 2022
32b3ae7
Issue #74: Testing test_datasets.py, update logging.
bongjinkoo May 17, 2022
21d3d16
Issue #74: Testing test_datasets.py, update logging.
bongjinkoo May 17, 2022
a0035dd
Issue #74: Testing test_datasets.py, trying a wrong command.
bongjinkoo May 17, 2022
753b5fa
Issue #74: Testing test_datasets.py.
bongjinkoo May 17, 2022
5486fb2
Issue #74: Testing test_datasets.py.
bongjinkoo May 17, 2022
73c2852
Issue #74: Testing test_datasets.py.
bongjinkoo May 17, 2022
bf15165
Issue #74: Testing test_datasets.py, using subprocess.check_output() …
bongjinkoo May 17, 2022
8951bb4
Issue #74: Testing test_datasets.py, using subprocess.run() with stdo…
bongjinkoo May 17, 2022
b334dfa
Issue #74: Testing test_datasets.py, outputting CONDA path.
bongjinkoo May 17, 2022
f521049
Issue #74: Testing test_datasets.py, prefixing CONDA path.
bongjinkoo May 17, 2022
d834324
Issue #74: Testing test_datasets.py, prefixing CONDA path.
bongjinkoo May 17, 2022
6d274c1
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
a5d1045
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
5852042
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
3aab671
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
603fd4e
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
da55acb
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
d3f3c29
Issue #74: Testing test_datasets.py, fix test file path.
bongjinkoo May 17, 2022
d550e4e
Issue #74: Testing test_datasets.py, adding --force for osf upload.
bongjinkoo May 17, 2022
7d6356a
Issue #74: Testing test_datasets.py, randomising the test file name a…
bongjinkoo May 17, 2022
61e6fbc
Issue #74: Update test_datasets.py.
bongjinkoo May 17, 2022
9039ec6
Issue #74: Update coverage for test_datasets.py.
bongjinkoo May 17, 2022
d0a5342
Issue #74: Remove os.system() in test_datasets.py.
bongjinkoo May 17, 2022
1ea2739
Issue #74: Add check flag for subprocess in test_datasets.py.
bongjinkoo May 17, 2022
355146b
Issue #74: Fix a long line in datasets.py.
bongjinkoo May 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ repos:
- --maxkb=2048
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 21.8b0
rev: 22.3.0
hooks:
- id: black
- repo: https://github.com/pycqa/isort
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ dependencies:
- pytest-cov
- pytorch
- pyyaml
- osfclient
- pip :
- git+https://github.com/compSPI/simSPI.git
- starfile
Expand Down
294 changes: 103 additions & 191 deletions ioSPI/datasets.py
Original file line number Diff line number Diff line change
@@ -1,36 +1,58 @@
"""Module to house methods related to datasets (micrographs, meta-data, etc.)."""

import io
import os
import typing
from pathlib import Path
import subprocess

import requests

class OSFProject:
"""Class to list, download and upload data in an OSF project.

class Project:
"""Class to list, download and upload data from OSF.
It uses osfclient library.

Parameters
----------
username : str
username : str, default = None
Username corresponding to an account on OSF.
E.g. email address used to create an OSF account.
token : str
Personal token from OSF.io.
token : str, default = None
Personal token from osf.io.
See: https://osf.io/settings/tokens
project_id : str, default = "7g42j"
project_id : str, default = "xbr2m"
Identifier of the project, found on the OSF project page.
E.g. 7g42j for project at https://osf.io/7g42j/
E.g. xbr2m for project at https://osf.io/xbr2m/
storage : str, default = "osfstorage"
Storage provider of the project.
osfclient_path : str, default = None

See Also
--------
osfclient: https://github.com/osfclient/osfclient
OSF API documentation : https://developer.osf.io/
"""

def __init__(self, username: str, token: str, project_id: str = "7g42j") -> None:
def __init__(
self,
username: str = None,
token: str = None,
project_id: str = "xbr2m",
storage: str = "osfstorage",
osfclient_path: str = None,
) -> None:
if username is None:
raise TypeError("username must be provided.")
self.username = username

if token is None:
raise TypeError("token must be provided.")
self.token = token

self.project_id = project_id
self.storage = storage
self.osfclient_path = osfclient_path
self.osfclient_command = "osf "
if osfclient_path is not None:
self.osfclient_command = self.osfclient_path + self.osfclient_command

config_path = os.path.join(".osfcli.config")
with open(config_path, "w") as out_file:
Expand All @@ -43,213 +65,103 @@ def __init__(self, username: str, token: str, project_id: str = "7g42j") -> None
def ls(self):
"""List all files in the project."""
print(f"Listing files from OSF project: {self.project_id}...")
os.system("osf ls")
return subprocess.run(
self.osfclient_command + "ls",
shell=True,
text=True,
check=True,
stdout=subprocess.PIPE,
).stdout

@staticmethod
def download(remote_path, local_path):
"""Download file from osf and save it locally.
def download(self, remote_path: str = None, local_path: str = None):
"""Download a file from an OSF project and save it locally.

Parameters
----------
remote_path : str
Remote path of the file on OSF.
remote_path : str, default = None
Remote path of the file in an OSF project,
which will be appended to the project storage name (by default, osfstorage).
E.g. osfstorage/
randomrot1D_nodisorder/
4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
local_path : str
local_path : str, default = None
Local path where the file will be saved.
E.g. 4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
"""
print(f"Downloading {remote_path} to {local_path}...")
os.system(f"osf fetch {remote_path} {local_path}")
if remote_path is None:
raise TypeError("remote_path must be provided.")
if local_path is None:
raise TypeError("local_path must be provided.")

full_remote_path = self.storage + "/" + remote_path
print(f"Downloading {full_remote_path} to {local_path}...")
subprocess.run(
self.osfclient_command + f"fetch {full_remote_path} {local_path}",
shell=True,
text=True,
check=True,
stdout=subprocess.PIPE,
)
print("Done!")

@staticmethod
def upload(remote_path, local_path):
"""Upload file to osf.
def upload(self, local_path: str = None, remote_path: str = None):
"""Upload a file to an OSF project.

Notes
-----
You should have requested permission to upload to the project first.

Parameters
----------
remote_path : str
Remote path of the file on OSF.
local_path : str, default = None
Local path where the file will be saved.
E.g. 4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
remote_path : str, default = None
Remote path of the file in an OSF project,
which will be appended to the project storage name (by default, osfstorage).
E.g. osfstorage/
randomrot1D_nodisorder/
4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
local_path : str
Local path where the file will be saved.
E.g. 4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
"""
print(f"Uploading {local_path} to {remote_path}...")
os.system(f"osf upload {local_path} {remote_path}")
if local_path is None:
raise TypeError("local_path must be provided.")
if remote_path is None:
raise TypeError("remote_path must be provided.")

full_remote_path = self.storage + "/" + remote_path
print(f"Uploading {local_path} to {full_remote_path}...")
f = subprocess.run(
self.osfclient_command + f"upload {local_path} " f"{full_remote_path}",
shell=True,
text=True,
check=True,
stdout=subprocess.PIPE,
).stdout
print(io.StringIO(f).readlines())
print("Done!")


class OSFUpload:
"""Class to upload datasets to OSF.io.

Parameters
----------
token : str
Personal token from OSF.io with access to dataset (e.g. cryoEM, etc).
data_node_guid : str, default = "24htr"
OSF GUID of data node that houses dataset.

Attributes
----------
headers : dict of type str:str
Headers containing authorisation token for requests.
base_url : str
OSF.io API url base.
data_node_guid : str
OSF GUID of data node that houses dataset.

See Also
--------
OSF API documentation : https://developer.osf.io/
"""

def __init__(self, token: str, data_node_guid: str = "24htr") -> None:

self.headers = {"Authorization": f"Bearer {token}"}
self.base_url = "https://api.osf.io/v2/"

requests.get(self.base_url, headers=self.headers).raise_for_status()

self.data_node_guid = data_node_guid

def read_structure_guid(self, structure_label: str) -> str:
"""Return GUID of OSF node for structures with given label.

If no existing node is found, returns none.

def remove(self, remote_path: str = None):
"""Remove a file in an OSF project.

Parameters
----------
structure_label:str
Structure ID from PDB or EMDB used for generating data.

Returns
-------
GUID of structure node on OSF.io

See Also
--------
Protein Data Bank(PDB) : https://www.rcsb.org/
EM Data Resource(EMDB) : https://www.emdataresource.org/
"""
existing_structures = self.read_existing_structure_labels()
if structure_label not in existing_structures:
return None
return existing_structures[structure_label]

def write_child_node(
self, parent_guid: str, title: str, tags: typing.Optional[str] = None
) -> str:
"""Write a new child node in OSF.io.

Parameters
----------
parent_guid:str
GUID of parent node.
title:str
Title of child node.
tags: list[sr], optional
Tags of child node.

Returns
-------
str
GUID of newly created child node.

Raises
------
HTTPError
Raised if POST request to OSF.io fails.
remote_path : str, default = None
Remote path of the file to remove in an OSF project,
which will be appended to the project storage name (by default, osfstorage).
E.g. osfstorage/
randomrot1D_nodisorder/
4v6x_randomrot_copy6_defocus3.0_yes_noise.txt
"""
request_url = f"{self.base_url}nodes/{parent_guid}/children/"

request_body = {
"type": "nodes",
"attributes": {"title": title, "category": "data", "public": True},
}

if tags is not None:
request_body["attributes"]["tags"] = tags

response = requests.post(
request_url, headers=self.headers, json={"data": request_body}
if remote_path is None:
raise TypeError("remote_path must be provided.")

full_remote_path = self.storage + "/" + remote_path
print(f"Removing {full_remote_path} in the project...")
subprocess.run(
self.osfclient_command + f"remove {full_remote_path}",
shell=True,
text=True,
check=True,
stdout=subprocess.PIPE,
)
response.raise_for_status()
return response.json()["data"]["id"]

def read_existing_structure_labels(self) -> typing.Dict[str, str]:
"""Get labels and GUIDs of structural nodes in OSF dataset.

Returns
-------
dict of type str : str
Returns dictionary of node labels mapped to node GUIDs.

Raises
------
HTTPError
Raised if GET request to OSF.io fails.
"""
request_url = f"{self.base_url}nodes/{self.data_node_guid}/children/"
response = requests.get(request_url, headers=self.headers)
response.raise_for_status()
dataset_node_children = response.json()["data"]

existing_structures = {
child["attributes"]["title"]: child["id"] for child in dataset_node_children
}

return existing_structures

def write_files(self, dataset_guid: str, file_paths: typing.List[str]):
"""Post files to a node in OSF.io.

Parameters
----------
dataset_guid : str
GUID of node where file is to be uploaded.
file_paths : list[str]
File paths of files to be uploaded.

Returns
-------
bool
True if all uploads are successful, false otherwise.
"""
files_base_url = "http://files.ca-1.osf.io/v1/resources/"
create_request_url = f"{files_base_url}{dataset_guid}/providers/osfstorage/"
success = True

for file_path_string in file_paths:
file_path = Path(file_path_string)

query_parameters = f"?kind=file&name={file_path.name}"
response = requests.put(
create_request_url + query_parameters, headers=self.headers
)
response.raise_for_status()

data_upload__url = response.json()["data"]["links"]["upload"]

with open(file_path, "rb") as file_content:
response = requests.put(
data_upload__url, data=file_content, headers=self.headers
)
response.raise_for_status()

if not response.ok:
print(f"Upload {file_path} failed with code {response.status_code}")
success = False
else:
print(f"Uploaded {file_path} ")

return success
print("Done!")
Loading