-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent data entry #211
Comments
Thanks @chasemc for pointing this out. The submitted projects are currently manually reviewed; however, most fields have required syntaxes to obey to, as the submission will otherwise not validate. Indeed, some genome sequence and biosample fields seem to be "incorrect". Typically, we cross check some examples, but after encountering some embargoed entries as well, it seems to be hard to find a good way to handle checking these out prior to accepting the project into PoDP. My suggestion would be to email the project submitter and ask for clarification and updating the project where possible (there is a possibility to edit current projects to update/correct the information. Let me know if you have any other suggestions or ideas! Happy to consider them! 😎 |
If accepting embargoed data then I think it would be good to have an "embargoed data/open access" field, maybe with a date the embargo ends?, so it is apparent which datasets other people will have access to. Checks on non-embargoed data could be done on submission, and embargoed could be checked after the input embargo-end? Otherwise it's going to be impossible to build tools that use the platform without manual metadata cleaning. |
Oh, I messed up those links, the second one should have been |
Good point, we will consider that! And that link looks indeed suspicious in terms of repetitive genbank accessions ids. |
@chasemc the second link will be updated soon. In the mean time, all the accession ids are available through the BioSample project. Thanks again for your interest in using the platform! 😎 |
I'm still having a rough time, even with manual intervention. Can you point me to a dataset that should be really clean/good links? Was working with this one since last night because it had good genomics and was fairly small. But I couldn't get the GNPS file names to link. It seems like the molecular network uses different files than the MASSIVE link and the pairedomics uses the filenames in the MASSIVE repo. |
https://pairedomicsdata.bioinformatics.nl/projects/297c364c-b154-4edd-a7d5-68decf9effa2.4 what about this one @chasemc? I agree that some manual intervention will remain needed for the foreseeable future.... I think the majority of these genomes can be downloaded, and the links should work. Let us know how you get on.... |
Thanks, that's the one I've landed on as well. The only problem so far isn't really a problem- I've written a downloader/parser for GNPS snets v2 results but not v1; I'm currently running all that data through GNPS v2. |
Is your feature request related to a problem? Please describe.
I've started to try and use the platform programatically (using the json files) but have had issues with the first two datasets I've tried to use. Maybe it's just these two but I was wondering if entries are/could be validated upon submission entry? Specifically, I've encountered
GenBank_accession
andBioSample_accession
filled with different types of data/accessions which has made it difficult to reliably programmatically discover and download the genetic data (currently using Entrez but even so it's not finding all the data).Some examples from the first couple of paired datasets I've tried looked at:
https://pairedomicsdata.bioinformatics.nl/projects/5e920f02-2ae7-4d58-a4b9-fc76740958cd.4
GenBank_accession
are clearly not GenBank accessions but user/auto-assigned "Assembly Names" e.g. "04_NF_x40_HMP1651v01", "ASM986560v1"BioSample_accession
are "AAAAA000000" but that doesn't exist in NCBI, is this the equivalent of "None" for paired omics?publications
: "00000000"https://pairedomicsdata.bioinformatics.nl/projects/5e920f02-2ae7-4d58-a4b9-fc76740958cd.4
GenBank_accession
is filled with a single "WWJO00000000" accessionsBioSample_accession
is filled with "BioProject" accessionsgenome_label
, or using the "BioProject' accession that's in theBioSample_accession
spaceDescribe the solution you'd like
I'm not sure
GenBank accession number
andRefSeq accession number
will be clear to everyone if it means "Assembly" accession; if so, it may beneficial to add 'assembly' e.g.RefSeq assembly accession
.Ideally accessions would be checked (e.g. via entrez), otherwise at least valid prefixes could be checked? 1 2)
The text was updated successfully, but these errors were encountered: