Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get ta taxonomy table from taxizedb? #64

Open
GossypiumH opened this issue Feb 27, 2023 · 2 comments
Open

How to get ta taxonomy table from taxizedb? #64

GossypiumH opened this issue Feb 27, 2023 · 2 comments

Comments

@GossypiumH
Copy link

Hello,

In January I encountered a problem with taxize API due to my number of bacterial taxa from witch I want to retrieve taxonomy (10k+) (I posted about my problem here : ropensci/taxize#907)

People advised me to use taxizedb, it works offline and should fix my problem. However, when I try to apply a simple command as:

test = classification(name2taxid(c(taxa$specie_ID)))

taxa is a dataframe with only one collumn named specie_ID, as flolow:

> head(taxa$specie_ID) [1] "Staphylococcus sp." "Acinetobacter sp." "Cutibacterium sp." "Sphingomonas sp." "Paenarthrobacter sp." [6] "Paracoccus sp."

However, I receive an error:

> test = classification(name2taxid(c(taxa$specie_ID))) Error in name2taxid(c(taxa$specie_ID)) : Some of the input names are ambiguous, try setting out_type to 'summary'

When I set out_type to summary; I got that:

> test = classification(name2taxid(c(taxa$specie_ID), out_type="summary")) Error in dplyr::summarize(): ℹ In argument: taxids = paste(.data$tax_id, collapse = "|"). ℹ In group 1: name = "Morganella sp.". Caused by error in .data$tax_id: ! Column tax_idnot found in.data`.
Backtrace:

  1. taxizedb::classification(name2taxid(c(taxa$specie_ID), out_type = "summary"))
  2. rlang:::abort_data_pronoun(x, call = y)`

Apparently Morganella sp. is not recognized by taxizedb. I'm not particularly familiar with dplyr of with taxize. So I just would like to know, how I could retrieve the taxonomy for each of my species of bacteria, preferentially in the form of a table with collumns like that:

Specie_ID Kindom Phyllum Class Order family genus

@stitam
Copy link
Collaborator

stitam commented Mar 1, 2023

Thanks @GossypiumH for raising this issue.

The issue is caused by taxons that can be linked with multiple taxids:

taxizedb::name2taxid("morganella", out_type = "summary")
#> # A tibble: 3 × 2
#>   name       id    
#>   <chr>      <chr> 
#> 1 morganella 581   
#> 2 morganella 90690 
#> 3 morganella 108061

Created on 2023-03-01 with reprex v2.0.2

A very small change to your approach should solve your issue: Run classification() on the id column of the name2taxid() output, not the whole object (maybe this is what you wanted to do in the first place, so it's just a typo thing?):

test = classification(name2taxid(c("morganella", "escherichia"), out_type = "summary")$id)

However, taxons with multiple taxids will inflate the number elements in your results which can cause problems in your downstream analysis. Because of this I would probably run name2taxid(out_type = "summary") first, resolve taxons with multiple taxids (investigate them manually, choose one and remove the rest from the tibble) and the then run classification()` on the data set with distinct taxons. I imagine there shouldn't be many taxons with multiple taxids.

Do you think this approach could be feasible?

@GossypiumH
Copy link
Author

Hello,

Thank you for your reply ! I will try your solution, I hope I will not have too many taxon with multiple taxID.

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants