Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent data entry #211

Open
chasemc opened this issue Feb 14, 2022 · 8 comments
Open

Inconsistent data entry #211

chasemc opened this issue Feb 14, 2022 · 8 comments

Comments

@chasemc
Copy link

chasemc commented Feb 14, 2022

Is your feature request related to a problem? Please describe.
I've started to try and use the platform programatically (using the json files) but have had issues with the first two datasets I've tried to use. Maybe it's just these two but I was wondering if entries are/could be validated upon submission entry? Specifically, I've encountered GenBank_accession and BioSample_accession filled with different types of data/accessions which has made it difficult to reliably programmatically discover and download the genetic data (currently using Entrez but even so it's not finding all the data).

Some examples from the first couple of paired datasets I've tried looked at:

Describe the solution you'd like

I'm not sure GenBank accession number and RefSeq accession number will be clear to everyone if it means "Assembly" accession; if so, it may beneficial to add 'assembly' e.g. RefSeq assembly accession.
Ideally accessions would be checked (e.g. via entrez), otherwise at least valid prefixes could be checked? 1 2)

@justinjjvanderhooft
Copy link

Thanks @chasemc for pointing this out. The submitted projects are currently manually reviewed; however, most fields have required syntaxes to obey to, as the submission will otherwise not validate. Indeed, some genome sequence and biosample fields seem to be "incorrect". Typically, we cross check some examples, but after encountering some embargoed entries as well, it seems to be hard to find a good way to handle checking these out prior to accepting the project into PoDP. My suggestion would be to email the project submitter and ask for clarification and updating the project where possible (there is a possibility to edit current projects to update/correct the information. Let me know if you have any other suggestions or ideas! Happy to consider them! 😎

@chasemc
Copy link
Author

chasemc commented Feb 14, 2022

If accepting embargoed data then I think it would be good to have an "embargoed data/open access" field, maybe with a date the embargo ends?, so it is apparent which datasets other people will have access to.

Checks on non-embargoed data could be done on submission, and embargoed could be checked after the input embargo-end?

Otherwise it's going to be impossible to build tools that use the platform without manual metadata cleaning.

@chasemc
Copy link
Author

chasemc commented Feb 14, 2022

Oh, I messed up those links, the second one should have been
https://pairedomicsdata.bioinformatics.nl/projects/b78b5817-86e2-4e5e-a087-a6b0d9710fce.3

@justinjjvanderhooft
Copy link

Good point, we will consider that! And that link looks indeed suspicious in terms of repetitive genbank accessions ids.

@justinjjvanderhooft
Copy link

@chasemc the second link will be updated soon. In the mean time, all the accession ids are available through the BioSample project. Thanks again for your interest in using the platform! 😎

@chasemc
Copy link
Author

chasemc commented Feb 15, 2022

I'm still having a rough time, even with manual intervention. Can you point me to a dataset that should be really clean/good links?

Was working with this one since last night because it had good genomics and was fairly small. But I couldn't get the GNPS file names to link. It seems like the molecular network uses different files than the MASSIVE link and the pairedomics uses the filenames in the MASSIVE repo.
https://pairedomicsdata.bioinformatics.nl/projects/1b0dccac-5212-4dfd-a9f2-6fa953ab16bd.5

@justinjjvanderhooft
Copy link

https://pairedomicsdata.bioinformatics.nl/projects/297c364c-b154-4edd-a7d5-68decf9effa2.4 what about this one @chasemc? I agree that some manual intervention will remain needed for the foreseeable future.... I think the majority of these genomes can be downloaded, and the links should work. Let us know how you get on....

@chasemc
Copy link
Author

chasemc commented Feb 15, 2022

Thanks, that's the one I've landed on as well. The only problem so far isn't really a problem- I've written a downloader/parser for GNPS snets v2 results but not v1; I'm currently running all that data through GNPS v2.
Fingers crossed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants