-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLad for multi-echo data access #13
Comments
The problem with the existing OpenNeuro datasets is that most don't have the echo-wise preprocessed data we need for our examples. We thought of just fMRIPrepping the open datasets ourselves and uploading the derivatives to OpenNeuro in separate "datasets" linking to the original ones, but OpenNeuro doesn't currently support uploading derivatives-only datasets (see OpenNeuroOrg/openneuro#2436), so I don't know if we can directly use OpenNeuro for most of our planned examples. Currently, we're looking at uploading fMRIPrep derivatives to the OSF and using a fetcher to grab them from there. Is there a storage alternative that would be more compatible with DataLad? |
Chris actually mentioned G-Node in that issue, which I had forgotten. Would that be a good alternative? I think we looked at it but decided against it for tedana's |
Yup GIN is a good option for public and free hosting of data (up to a number of terabytes per account/repo iirc). And it works well with standard DataLad functionality. See here for a walkthrough of how to publish/connect a DataLad dataset to GIN: https://handbook.datalad.org/en/latest/basics/101-139-gin.html DataLad also has an extension for integrating with OSF, http://docs.datalad.org/projects/osf/en/latest/, so that's also a possibility. I guess it depends on which dependencies are fine to include (if any at all) for which packages ( Either way, if DataLad is an alternative for getting data used in the book, I can see the superdataset having a structure like this:
where all raw or derivative datasets would essentially be submodules that symlink to these respective datasets, which are in turn either hosted on OpenNeuro (i.e. the raw datasets) or, for example, on GIN (i.e. derivative datasets). Having all of these structured as a hierarchy of nested datalad datasets makes it very easy for datalad to give users access to any specific (sub)datasets and/or files. |
Here's v1 of the super-dataset, currently containing only raw subdatasets that are hosted on OpenNeuro: https://github.com/jsheunis/multi-echo-super |
The multi-echo-super dataset now has all open multi-echo datasets from OpenNeuro included (as far as I'm aware) and also the fmriprep processed data of the Multi-echo Cambridge dataset that's on OSF (see this comment) @notZaki, did you use the OSF API to get file paths and urls in order to build the |
@jsheunis Here's a link to the manifest fie for the masking test dataset: I made this julia package to make the json file. There is an example on the readme on how to produce such files. Alternatively, the osfclient package for python might also be able to do something similar, but I haven't used it. |
Oh, that's perfect, thanks @notZaki ! |
And thanks for the pointers to your julia package and osfclient 👍 |
@emdupre has also made csv files for fetching data, but I don't remember how that was done. |
I had just grabbed them with Python requests; here's a short gist demonstrating the idea. That really works best for flat directory structures, but for more nested ones you'll have to add another loop ! At some point I tried |
Thanks! I'll update here in case I try the recent |
Is there a good way to use the datalad Python tool or repo2data to grab only a single folder from a G-Node GIN or datalad dataset? I think installing the whole dataset would take too long in some cases (e.g., with the Cambridge and Le Petit Prince fMRIPrep derivatives). |
@tsalo Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)? And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally? With datalad you can clone (a.k.a. install) the whole dataset easily, e.g. : $ datalad clone https://github.com/jsheunis/multi-echo-cambridge-fmriprep.git This clones the dataset's git repo and some datalad config files, but no file content. It takes a few seconds. And then you can get (and drop) specific file content on demand, e.g. all files within a directory at a specified relative path: $ cd multi-echo-cambridge-fmriprep
$ datalad get datalad get sub-20847/figures/*
get(ok): sub-20847/figures/sub-20847_task-rest_desc-rois_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-carpetplot_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-summary_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_space-MNI152NLin2009cAsym_T1w.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-summary_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-confoundcorr_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-conform_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-validation_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-compcorvar_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-about_T1w.html (file) [from web...]
[2 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
get (ok: 12) |
Sorry for the confusion.
I'm referring to just getting the file tree.
I'm referring to installing in the datalad sense.
My concern is that Maybe more is indexed with git (vs. git-annex) on G-Node GIN by default, but it seemed like most non-nifti files were downloaded in the |
Thanks for clarifying, and for the link to the repo. It looks like the dataset has too many files in git vs git-annex. If you used datalad to create the dataset, the way you can control this is via configurations: https://handbook.datalad.org/en/latest/basics/101-122-config.html A way you can amend the dataset such that files are moved from git to git-annex (and removed from the git history) is described here: http://handbook.datalad.org/en/latest/beyond_basics/101-162-springcleaning.html#getting-contents-out-of-git. It involves:
This handbook chapter also describes other ways to keep dataset size small, e.g. using subdatasets per subject: http://handbook.datalad.org/en/latest/beyond_basics/101-161-biganalyses.html#calculate-in-greater-numbers |
Ohhhh thanks! I'll try modifying the dataset. That will make using it way easier! Do you have a recommendation for downloading the data for this book? Should we use datalad to clone the dataset and install one subject's data in a separate script (e.g., the download_data chapter), or can we use repo2data for this? |
Do you mean when downloading data for the book during the building process? I would say datalad is a good option, yes, if we do have all datasets available as datalad datasets (that was what I intended when creating this issue), and if the infrastructure that we're running the building process or the notebooks on had the requirements for datalad installed. I see there's a github action workflow using ubuntu to build the book, so it will be easy to add steps for installing git annex and datalad. It looks like all the publicly available datasets listed in the book are already included in the multi-echo-super dataset here: https://github.com/jsheunis/multi-echo-super/tree/main/raw, and the derivatives are added as they are made available, so I think datalad should work. The way to access individual subjects' files of specific datasets would then be:
|
What do you think about using DataLad to streamline data access for publicly available ME datasets? It looks like all of the datasets used in the book that don't require a data use agreement are on OpenNeuro, i.e. they are already DataLad datasets. It will be easy to include those as subdatasets into a multi-echo "super dataset" that people can clone and then download individual subdatasets or files selectively.
Of course, we don't have to make DataLad a requirement for people working with the book's tutorials, so this could also just be an alternative for those who have datalad installed.
Additionally, if some tutorials can be run on Binder, we have this ready-made config for running datalad on binder: https://github.com/datalad/datalad-binder
The text was updated successfully, but these errors were encountered: