Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for update to Taxon Table #93

Open
AlexaBennett opened this issue Jan 16, 2021 · 7 comments
Open

Request for update to Taxon Table #93

AlexaBennett opened this issue Jan 16, 2021 · 7 comments

Comments

@AlexaBennett
Copy link

Request to update the get-ncbi-data to format the taxon table as it would appear in SILVA. Currently, the scripts populates all ranks regardless of presence in TaxID lineage.

Example of actual:

Feature ID Taxon
KR045484.1 k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium

Example of desired:

Feature ID Taxon
KR045484.1 k__Bacteria; p__; c__; o__; f__; g__; s__

Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy? This utility would be ideal for those of us with a curated list of sequences from a HMM. It is common to retrieve sequences from genomic assemblies. It is also possible to have multiple copies with genetic variation between the genomic regions.

@nbokulich
Copy link
Collaborator

Hi @AlexaBennett ,

Currently, the scripts populates all ranks regardless of presence in TaxID lineage

By default, it does. But there is a rank propagation option: to disable, use --p-no-rank-propagation with the CLI or rank_propagation=False with the API. For example:

>>> import qiime2 as q2, pandas as pd
>>> from qiime2.plugins import rescript as rs
>>> ids = q2.Metadata(pd.DataFrame(['KR045484.1'], index=pd.Index(['KR045484.1'], name='id'), columns=['col']))
>>> s1, t1, = rs.actions.get_ncbi_data(accession_ids=ids)
>>> t1.view(pd.Series)[0]
'k__Bacteria; p__Bacteria; c__Bacteria; o__Bacteria; f__Bacteria; g__uncultured; s__bacterium'
>>> s2, t2, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False)
>>> t2.view(pd.Series)[0]
'k__Environmental samples; p__; c__; o__; f__; g__uncultured; s__bacterium'
>>> s3, t3, = rs.actions.get_ncbi_data(accession_ids=ids, rank_propagation=False, ranks=['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species'])
>>> t3.view(pd.Series)[0]
'sk__Bacteria; p__; c__; o__; f__; g__uncultured; s__bacterium'

The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations g__uncultured; s__bacterium), but as shown in the third example you can pick and choose your ranks.

Does that meet your need?

Additionally, would it be possible to allow for specification of accession number target range? E.x. 'KR045484.1:3-200'. Or, to specify only to retrieve the taxonomy?

I really like this idea, and it was on our radar at some point. @BenKaehler can you enlighten us as to whether there is a workaround for this and/or what it would take to implement?

specify only to retrieve the taxonomy?

Also a great idea. I don't think there would be an issue with adding an option to ignore the sequences, and e.g., get-silva-data has an equivalent option for skipping the sequences.

We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉

@AlexaBennett
Copy link
Author

The taxonomy does not perfectly match your desired taxonomy because disabling rank propagation yields the raw NCBI taxonomy (which in this case has no kingdom annotation as in example 2, and contains annotations g__uncultured; s__bacterium), but as shown in the third example you can pick and choose your ranks.

Does that meet your need?

I can make it work. I might have to use superkingdom as in your last example. If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information. 🤔

We are very open to contributions if you are interested in tackling either of these features @AlexaBennett 😉

Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?

@nbokulich
Copy link
Collaborator

If I am not mistaken 'superkingdom' is the same as 'domain'? Therefore, it is reasonable to use the later in a prokaryotic study without loss of information

correct and that sounds reasonable, but I cannot guarantee that this is uniform across the NCBI taxonomy... e.g., that annotations for superkingdom are consistently used across entries, so it would be worth checking after you get the data whether any superkingdom annotations are missing if that's a problem for your application.

Sure! I have an idea of tackle them, but I am a only a 3rd year graduate student coming from a Microbiology background. Would there be anyone I may contact if I have questions throughout the process?

Great! Yes, feel free to ask any questions here that are specific to RESCRIPt/this issue, and for more general development questions you can ask on teh QIIME 2 forum.

To get you started:

  1. This will be a useful resource for starting with QIIME 2 development: https://dev.qiime2.org/latest/
  2. optional trimming based on position will only be available when passing in a list of IDs for explicit download, so the length ranges could be added as a column in that metadata file. To make this more generalizable, it might be worth making this a separate method for trimming sequences based on seq ID... then it would be possible to, e.g., get seqs using a query, then trim based on specific IDs.
  3. For skipping download of sequences, take a look at get-silva-data to get started... the mechanism for skipping would be quite different within the method, but "on the surface" we should make the API align as much as possible.

Keep us updated and let us know if you run into problems. Thanks!

@BenKaehler
Copy link
Collaborator

Hi @AlexaBennett, very sorry for the slow response.

Unless I've missed something, it is fairly straightforward to download a range of accessions. I didn't understand your example, but this one appears to work as expected:

qiime rescript get-ncbi-data \
  --p-query 'KR045200:KR045483[accn]' \
  --o-sequences range-seqs.qza \
  --o-taxonomy range-tax.qza \
  --p-no-rank-propagation \
  --verbose

The issue of avoiding downloading the sequences when downloading the taxonomies is a bit trickier.

At the moment the script downloads the sequences and the taxids for the given accessions in the same download.

It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.

@nbokulich, I'm not sure that this is a good beginner issue. The unit tests on their own would very challenging for a beginner coder.

@AlexaBennett, how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.

@nbokulich
Copy link
Collaborator

it is fairly straightforward to download a range of accessions

Great news, thanks for the example @BenKaehler !

But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?

@BenKaehler
Copy link
Collaborator

BenKaehler commented Feb 8, 2021

Oh, gotcha, that is covered by the "Unless I've missed something".

Adding that functionality would be a reasonable beginner issue.

@AlexaBennett
Copy link
Author

But I think @AlexaBennett 's request is to trim to a specific position from within a genome. Is this already possible @BenKaehler ?

Yes, I am interested in the ability to trim to 1+ positions within the genome.

how badly do you need to be able to download taxa without sequences? As far as I can see the only challenge might be performance, given that you could delete the sequences after you download them. Not saying that it wouldn't be a nice feature to have, just that there is an obvious workaround until we get the opportunity to implement it.

This is the current workaround I am utilizing in the QIIME environment. However, my primary dilemma is a product of having to strip the positions from accession numbers. The resulting taxa file contains only accessions ids and taxa values. It has thus dereplicated accession numbers from organisms with multiple copy numbers. My current method to handle this is the following:

  1. Extract the taxa file form QIIME
  2. Use python to create a dictionary {accessionId: taxa}
  3. Loop over original FASTA file utilizing regex to match the accession ids
  4. Print the comment line from the FASTA with the corresponding taxa value
  5. Import richer taxa file into QIIME

It is probably possible to download the metadata around the accession ids without downloading the sequences, but I would have to play around with the download formats.

The script I am currently developing extracts the TaxId from the eSummary (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi). That bypasses having to download the sequences. @BenKaehler, could that work in the current script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants