Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include variations from processed data in nextstrain #10

Open
gtauriello opened this issue Apr 5, 2020 · 11 comments · May be fixed by #24
Open

Include variations from processed data in nextstrain #10

gtauriello opened this issue Apr 5, 2020 · 11 comments · May be fixed by #24
Labels
enhancement New feature or request in progress This is currently being worked on

Comments

@gtauriello
Copy link
Contributor

Goal is to have a structure-mapped version of the variations displayed in nextstrain.
We envision the following required steps:

  • parse json data following their dev docs
  • map variations onto UniProtKB ACs used in SWISS-MODEL (the work done at the UCSC Genome Browser could be helpful for this)
  • define colors and annotation texts variations
  • test using SWISS-MODEL's annotation system
  • properly acknowledge source of data (see also "Data" section in nextstrain's README)
  • followups: add possibility to filter results (e.g. only from country X or certain confidence), process into entropies, ...
@gtauriello
Copy link
Contributor Author

gtauriello commented Apr 5, 2020

Preliminary work by @jttkim (see here) could be a great starting point for such an effort.

@gtauriello gtauriello added the enhancement New feature or request label Apr 6, 2020
@tomasMasson
Copy link

I'll start working in the scripts to fetch variation data from Nextstrain. If you want @gtauriello, I can create a new branch so everyone can see/review the code.

@gtauriello
Copy link
Contributor Author

That's great. Thank you. Yes please do this in a new branch or start a pull request early so people can comment on your code.

@gtauriello gtauriello added the in progress This is currently being worked on label Apr 6, 2020
@D-Barradas
Copy link

D-Barradas commented Apr 6, 2020

Hi @tomasMasson I'm interested in the branch you will create so I was also working into parsing the variation of nextstrain , I got a result, but my code is very basic and could be more pythonic, so I'm really interested in seeing a code, also what I found as mutations are very strange to me like N3833K (below), I retrieved like 50 like this , so Im asking for a friend here if somebody knows whats with that large number

   gene	          GenBank.        gisaid_epi_isl	     mutations	   author
                      accession  	
   ORF1a	    LR757998	    EPI_ISL_406798	    **L2235I**  	Chen et al|
   ORF1a	    LR757998	    EPI_ISL_406798	    **N3833K**	Chen et al|

@gtauriello
Copy link
Contributor Author

@D-Barradas not sure what you mean with strange mutations. You mean because of 3833 being a large number? ORF1a (aka 'Replicase polyprotein 1a' or P0DTC1 or R1A_SARS2) is indeed a 4405 AA long polyprotein (which is cut into smaller pieces). So not too surprising.

Also please don't map mutations to ORF1a but to the longer ORF1ab (aka 'Replicase polyprotein 1ab' or P0DTD1 or R1AB_SARS2) as described in the README of this repo whenever possible. There is a small part (nsp11) at the end of ORF1a where this is ambiguous though due to a ribosomal frameshift (see here for details). There you can either map genome-level variations to both ORF1a and ORF1ab or just keep ignoring the ORF1a part since I am not aware of any relevant role of nsp11.

@D-Barradas
Copy link

@gtauriello thanks for solving my question, it was in did about the number since I was thinking in terms of smaller pieces (400 aa ), then another question, they report in nextstrain ORF1a and ORF1b as separate entities, should we also ignore the ORF1b just to be safe?

  ORF1a ORF1b
end 13468 21555
seqid config/reference.gb config/reference.gb
start 266 13468
strand + +
type CDS CDS

@gtauriello
Copy link
Contributor Author

With ignoring I just meant the part in ORF1a which differs from ORF1ab. Just to be clear...

For the naming used here with ORF1a and ORF1b, we should keep all those variations and map them to ORF1ab (P0DTD1) for both. I suppose one needs to be careful with mutations at genome-position 13468 as they can affect 2 amino acids though but no idea how nextstrain handles that.

It seems that nextstrain already maps the mutations into protein-sequence space and so with an appropriate offset you should be able to easily map ORF1b to ORF1ab. But please do add some sanity checks to make sure that the sequences match (i.e. if you map "K2160E" from ORF1b onto P0DTD1 we expect a 'K' at that position...).

@gtauriello gtauriello linked a pull request Apr 8, 2020 that will close this issue
@gtauriello gtauriello linked a pull request Apr 9, 2020 that will close this issue
@gtauriello
Copy link
Contributor Author

A possible followup for this could use data from the China National Center for Bioinformation as done in this related resource from UC Riverside: https://coronavirus3d.org/index.html

@gtauriello
Copy link
Contributor Author

gtauriello commented Apr 25, 2020

Two more comments on the above:

  1. Unsure whether that source for mutations is illegally bypassing GISAID data sharing policies (based on discussions in the public_sequence_resource topic of the biohackathon). So we should use it with care probably. The main source of data there is GISAID and Genebank.
  2. Nextstrain is subsampling their phylogenetic tree (see this discussion here). So we may need another approach to get the full set of variations.

@tomasMasson
Copy link

I'll give it a look at both points.

@tomasMasson
Copy link

It looks like Nextstrain guys are releasing the full dataset (12397 genome) at their viz page nextstrain/ncov#364 (comment), with the raw data living at http://data.nextstrain.org/ncov_global.json. However, I could count only 3123 GISAID genomes (pass the json data though a grep filter in the command line).
Additionally, at http://cov-glue.cvr.gla.ac.uk/#/home they released a table with amino acid replacements for the GISAID sequences. The problem with this site is the lack of a download bottom for the data (it is an alpha version, maybe they are going to add it later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request in progress This is currently being worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants