Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract annotations from IntAct / ComplexPortal #25

Open
gtauriello opened this issue Apr 7, 2020 · 16 comments
Open

Extract annotations from IntAct / ComplexPortal #25

gtauriello opened this issue Apr 7, 2020 · 16 comments
Labels
enhancement New feature or request

Comments

@gtauriello
Copy link
Contributor

gtauriello commented Apr 7, 2020

The two EBI resources IntAct and ComplexPortal contain curated data on experimentally observed interactions between proteins.

From the EBI webpage you find links to query the IntAct webpage or download the IntAct data in PSI-MI TAB format here: [ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psi25/datasets/Coronavirus.zip].

Notes:

  • Special interest in the data are the entries which have positional information: e.g. mutations done to use a protein as bait to find a set of interactions (example) or a restricted range of the protein responsible for an interaction (example). For the linked examples, look for the little red "F" in the last column of the participant table to see the positional data.
  • The data is already mapped on UniProt-sequences but care must be taken for the polyproteins (P0DTD1/P0DTC1, see here for info). The "PRO_..." notation in IntAct should be mappable to PTM-annotations in UniProt for the polyproteins (P0DTD1 and P0DTC1).
  • Data is available from MINT and IntAct and both can be used without issues.
  • Query above might include data from several CoVs, not just SARS-CoV-2, so you may have to parse through it by taxid.
  • ComplexPortal data can be found by looking for SARS-CoV-2 here

Also: Birgit Meldal from the IntAct / ComplexPortal team is available in the Slack channel for questions and I will update this comment if we get new input and links that can be of general use.

@gtauriello gtauriello added the enhancement New feature or request label Apr 7, 2020
@bmeldal
Copy link

bmeldal commented Apr 7, 2020

I'm here!

Just a small, political, correction: there are 10 members of the IMEx consortium that curate into IntAct. MINT and IntAct itself are just 2 of them. E.g., DIP have also contributed many SARS publications in this last month.

And yes, we have also decided to annotate to the longer polyprotein in SARS-CoV and SARS-CoV-2 (e.g. R1AB, P0DTD1) except for the small protein nsp11 that is only translated from the short polyprotein of SARS-CoV-2 (R1a, P0DTC1). The long polyprotein codes for nsp12 at the ribosomal slippage site.

For Complex Portal you can find the data via our organism page:
https://www.ebi.ac.uk/complexportal/complex/organisms
It also has a WS JSON endpoint but only via the individual AC queries.
Or download the whole species file in xml via:
ftp://ftp.ebi.ac.uk/pub/databases/intact/complex/current/psi30/

Any questions, please ask! Slack ID is the same as GitHub.

@gtauriello
Copy link
Contributor Author

@all-contributors please add @bmeldal for ideas, content

@allcontributors
Copy link
Contributor

@gtauriello

I've put up a pull request to add @bmeldal! 🎉

@bmeldal
Copy link

bmeldal commented Apr 7, 2020

Thank you!

@D-Barradas
Copy link

@gtauriello so the annotations are only for the virus proteins, right?

@bmeldal
Copy link

bmeldal commented Apr 8, 2020

IntAct & ComplexPortal have both, virus and human proteins.
Not sure if that was your question, though ;-)

@gtauriello
Copy link
Contributor Author

gtauriello commented Apr 8, 2020

@D-Barradas also unsure about the question.

Personally, I would start by looking at all interactions returned in the query above (or the download) and extract any positional data you can find. The query should restrict it to coronavirus-relevant interactions. The annotation system works for any UniProtKB AC and not just the virus proteins. So you can safely have annotations mapped e.g. on structures for the human proteins involved in those interactions...

@D-Barradas
Copy link

Hi @gtauriello @bmeldal :
sorry for the cryptic question, basically you have answer my question, I already uploaded my annotations, in that process, I found that the server does not like the the PRO_ 👍
Couldn't find P0DTD1-PRO_0000449623 by UniProt AC or MD5. <- this was the warning

@gtauriello
Copy link
Contributor Author

Yes for the polyproteins, you will need to do some extra mapping. Assuming you have a position within P0DTD1-PRO_... (or P0DTC1-PRO_...) you need to proceed as follows:

  1. Extract the start/end of those PRO_... from UniProt: P0DTD1 and P0DTC1. You can do this either manually or parse the UniProt-files looking for the "FT CHAIN" entries (P0DTD1 and P0DTC1)...
  2. You should be able to use the start in UniProt to offset your data.

As an example: say you have position 10 in P0DTD1-PRO_0000449623. From UniProt you see that PRO_0000449623 covers positions 3264-3569. That means that pos. 10 in P0DTD1-PRO_0000449623 corresponds to pos. 3273 in P0DTD1.

Also any position that you find in P0DTC1, should be mapped to P0DTD1 as long as it's not in the "Non-structural protein 11" (i.e. position >= 4393 of P0DTC1). Technically you could also duplicate all those annotations but it's easier to have them just once...

@bmeldal I am assuming above that your positions are 1-indexed: i.e. that the first AA of a protein is at position "1" and not "0". Is that correct?

@bmeldal
Copy link

bmeldal commented Apr 9, 2020

Morning,

Yes, that is all correct!
It's a shame that UniProt doesn't allow the PRO-chain search by default but @gtauriello 's workaround is correct.
And yes, chain positions are 1-indexed.
We should only have used P0DTD1 except for nsp11.

@gtauriello
Copy link
Contributor Author

A nice example is here (thx @D-Barradas for pointing me to it). I quickly turned it manually into an annotation (see project link here):

P0DTC2,481,487,#FF0000,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation disrupting strength (p.Asn481_Asn487delinsThrProProAlaLeuAsn)
P0DTC2,493,493,#00FF00,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation decreasing strength (p.Gln493Asn)
P0DTC2,493,493,#00FF00,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation decreasing strength (p.Gln493Tyr)
P0DTC2,501,501,#FF0000,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,mutation disrupting strength (p.Asn501Thr)
Q9BYF1,18,633,#0000FF,https://www.ebi.ac.uk/intact/interaction/EBI-25496287,sufficient to bind (ecd)

I will make sure that on our side we can nicely display annotations on both subunits of heteromers (currently you can see either ACE2 or spike annotations but not both at the same time).

Having a script that scans IntAct to extract a csv like above automatically (with some clever coloring logic) would be a really useful addition.

@gtauriello
Copy link
Contributor Author

As a starting point here some files (thx @D-Barradas ): Archive.zip

It contains:

  • a Python script that can act as a starting point to parse the data. It needs pandas and this repo in PYTHONPATH to work. See the first few lines in the code for files to be downloaded and adapt paths accordingly. This is unfinished work but the script produces tsv formatted text which is almost ready for upload.
  • an image based on this data which showcases this type of annotation.

Still TODO:

  • Fix "-PRO.." identifiers and translate to position in UniProt sequence
  • Cleanup script
  • Extract better text and URLs

@gtauriello
Copy link
Contributor Author

So we ended up doing another script to extract PPI between SARS-CoV-2 and human proteins from IntAct. The script is loosely based on the one above and attached here: PPI-IntAct.zip

The result of it is a dedicated page on our server listing the structural coverage for all those interaction partners: https://swissmodel.expasy.org/repository/species/2697049/interactions

@bmeldal
Copy link

bmeldal commented May 15, 2020

There's a typo on https://swissmodel.expasy.org/repository/species/2697049

"IntAct lists interactions derived from literature curation or direct user submissions. We extracted those interactions and list the ones between SARS-CoV-2 and human host proteins with their structural coverage in a decicated interaction page." should read dedicated

Freudian slip??? I know the data is not yet saturated... ;-)

Great work!

Please remember to cite IntAct in any resulting manuscripts.

@bmeldal
Copy link

bmeldal commented May 15, 2020

Feature suggestion:

On the interactions page: https://swissmodel.expasy.org/repository/species/2697049/interactions

Allow the user to collapse the list for a given protein again without having to open another one. When the list is long (eg spike) it becomes difficult to navigate the page.

@gtauriello
Copy link
Contributor Author

Oops good point with the typo. I must have been thirsty when I wrote that... ;-)
The list gets collapsed as soon as you choose another one but we can add the feature. Doesn't hurt...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants