Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a potential recipe on conversion of secondary IDs to primary IDs #340

Open
1 of 10 tasks
lucas-ubm opened this issue Jun 4, 2021 · 9 comments
Open
1 of 10 tasks
Assignees
Labels
author's task: write recipe author has to write the full recipe issue type: proposal - new recipe issue suggests to create one new recipe

Comments

@lucas-ubm
Copy link
Collaborator

lucas-ubm commented Jun 4, 2021

For the BridgeDbR issue that sparked the idea for this recipe see here by @DeniseSl22. In this recipe we would provide a workflow to map secondary IDs to primary IDs. The recipe would therefore be mostly hands-on but it could also include a theoretical part to highlight the importance of the task in the context of improving data interoperability (as long as the content doesn't overlap with the Identifier mapping recipe. I could be in charge of the development of the required scripts/jupyter notebooks and might be able to also work on the theoretical side if we decide to also include it.

  • identify author
  • agree with editors on scope
  • write abstract
  • write feedback on abstract
  • make corresponding changes to abstract
  • write recipe
  • identify reviewer
  • conduct review
  • incorporate reviewer's comments
  • publish recipe
@lucas-ubm lucas-ubm added issue type: proposal issue contains general proposals on how to proceed forwards author's task: write recipe author has to write the full recipe labels Jun 4, 2021
@lucas-ubm lucas-ubm assigned egonw and lucas-ubm and unassigned egonw and lucas-ubm Jun 4, 2021
@egonw
Copy link
Collaborator

egonw commented Jun 6, 2021

The problem here is this. Many database delete or deprecate identifiers, including ELIXIR Core Resources. It is currently not easy to figure out in some data set if outdated identifiers are used. If there are, then a string match of identifiers with up-to-date database does not work. The BridgeDb Project started collecting this kind in information (see BridgeDb Tiwid) but there is no guidance on how to use this yet.

@egonw
Copy link
Collaborator

egonw commented Jun 6, 2021

Author would be Lucas, with support from @DeniseSl22 and me.

@DeniseSl22
Copy link
Contributor

But, there are also some databases that do keep track of their secondary IDs (HMDB, ChEBI for example). So I would start with those ("the easy example"), and then we can think of how to apply a similar workflow to the removed IDs which we just don't have in a mapping file. There's also the BED tool which found a way to keep track of these (using a graph database) for gene/protein identifiers

@ghost
Copy link

ghost commented Jun 9, 2021

as mentioned today in the call, but written here: for me it is not even clear what secondary and primary identifiers are. But I am looking forward to the abstract to see whether I understand it then. Best!

@DeniseSl22
Copy link
Contributor

@robertgiessmann : thanks for asking! The primary IDs are the IDs the databases wants you to use, when referring to a specific molecular entity. The secondary IDs, are IDs that a database has which refer to a similar entity as was meant by the primary one (so duplicates). These IDs are at some point cleaned up and considered "old/outdated", linked to the primary one (in various ways), or deleted. When there is a link between primary and secondary, we can actually understand which entity is meant in a dataset which is annotated with old IDs. I hope this explains it a bit more, if not let me know!

@ghost
Copy link

ghost commented Jun 9, 2021

Hi @DeniseSl22 , sorry to say, actually I am more confused now. Do you speak about cross-refs? Shall we consider a specific example?

I noticed https://github.com/bridgedb/tiwid/ -- taking from there:

https://github.com/bridgedb/tiwid/blob/main/data/chebi.csv

there was once a ChEBI identifier "594834" (I guess...) -- which does not resolve right now: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:594834 => returns error

Doing a Google Search for it, and looking at https://bioinformatics.charite.de/supertarget/index.php?site=drug_target&id=9977819-CSK_HUMAN , it seems CHEBI:594834 refered once to https://pubchem.ncbi.nlm.nih.gov/compound/9977819 , which has status Non-live in PubChem, and was thus probably remove from ChEBI.

We can only speculate on the reasons of this Non-live, as there is no information, but that doesn't matter after all.

Is there a primary and secondary identifier in here, already?

@DeniseSl22
Copy link
Contributor

Hi @robertgiessmann , more confusion was not my intention ;) the example you provide, is indeed an example of an ID which has been removed (and we don't really know where it went, finding that is probably only possible by asking the ChEBI team and them checking their archived data or so....). Here's two example, which hopefully puts it in more perspective:

image

Water has as primary ID: CHEBI:15377; there's also an item on the ChEBI website called "secondary ChEBI IDs", with items: "CHEBI:5585, CHEBI:42857, CHEBI:42043, CHEBI:44292, CHEBI:44819, CHEBI:43228, CHEBI:44701, CHEBI:10743, CHEBI:13352, CHEBI:27313" .
So, within the CHEBI database at some point there where 11 entries for water (1x the primary ID, and 10x the secondary IDs). These have been merged to one database entry, where the ID 15377 has been selected as the one to use from now on (the primary one).

image

Urobilin in HMDB has the main/primary ID: HMDB0004160 (that's also linked the URL for this item, and other databases should use as cross-ref. This compound also has some other IDs connected to it (under "Secondary Accession numbers"): HMDB0004159, HMDB0004161, HMDB04159, HMDB04160, HMDB04161 . So again, at some point the HMDB team realised they had the same compound in their database as individual entries, after which they merged them to one entry, and selecting one ID as the new main one (the primary). And with HMDB, there's also the change in ID structure; the HMDB 3.x version used IDs with the structure HMDHabcde, while HMDB 4.x uses HMDB00abcde (with abcde as random numbers). So, the entry for Urobilin was present in HMDB 3.x three times (HMDB04159, HMDB04160, HMDB04161). Then the ID structure itself got changed (to (HMDB0004159, HMDB0004160, HMDB0004161). And after that, the compound was considered the same, and the three entries got merged and one ID was selected to be the main one (HMDB0004160).

Lot's of databases have these "issues", since duplicate entries need to be dealt with. I think the way ChEBI and HMDB do this, makes the changes traceable (in the two examples above, not in the example you provided). Removing duplicate entries without being able to link that information together, creates problems for data analysis (as is the case with your example, finding out which compound is meant with "CHEBI:594834" will take quite some time).

@ghost
Copy link

ghost commented Jun 9, 2021

Ah, I see now -- also where the wording "secondary identifier" derives from...

Cool! Well, yeah, that's a common problem.

I would split it into multiple issues, I guess:

Do you see any more intrinsic aspects of this, in this context?

Thanks again for the explanation, really helped me a lot -- I guess this can be recycled straight into the recipe! 👍

@ghost ghost added issue type: proposal - new recipe issue suggests to create one new recipe and removed issue type: proposal issue contains general proposals on how to proceed forwards labels Aug 3, 2021
@egonw egonw assigned tabbassidaloii and unassigned lucas-ubm Feb 21, 2022
@tabbassidaloii
Copy link
Collaborator

@proccaserra, It will take some time to wrap up this recipe, but I am working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author's task: write recipe author has to write the full recipe issue type: proposal - new recipe issue suggests to create one new recipe
Projects
None yet
Development

No branches or pull requests

4 participants