Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unranked targets when building custom DB #35

Open
donovan-h-parks opened this issue Jan 18, 2023 · 5 comments
Open

Unranked targets when building custom DB #35

donovan-h-parks opened this issue Jan 18, 2023 · 5 comments

Comments

@donovan-h-parks
Copy link

Hi,

I'm building a custom DB from a large set of genome files. I'm indicating the TaxonId of each sequence using the NCBI-style accession2taxid tab-separated files. When I build the DB, it appears some sequences are not being given a rank. Specifically, the output of metacache build indicates 262383 targets remain unranked. Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)? Is there any way to determine which sequences remain unranked?

Thanks,
Donovan

>metacache build gtdb_r207_db ./gtdb_reps/genomes/ -taxonomy ./gtdb_taxonomy/R207 -reset-taxa -taxpostmap accn_to_taxid.tsv

Building new database 'gtdb_r207_db' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 401816 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.2.3 (20220708)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Processing reference sequences.
Added 8302685 reference sequences in 14348.7 s
targets              8302685
ranked targets       0
taxa in tree         401816
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 46.7784 +/- 61.2667 <> 2.0026
features             518555067
dead features        0
locations            24257165919
------------------------------------------------
8302685 targets are unranked.
Try to map sequences to taxa using 'accn_to_taxid.tsv' (381 MB)
262383 targets remain unranked.
Writing database to file ... Writing database metadata to file 'gtdb_r207_db.meta' ... done.
Writing database part to file 'gtdb_r207_db.cache0' ... done.
done.
@Funatiq
Copy link
Collaborator

Funatiq commented Jan 18, 2023

Does this mean that MetaCache identified and placed sequenced in the DB that do not have a taxonomic assignment (i.e. associated TaxonId)?

Yes, exactly.

Is there any way to determine which sequences remain unranked?

Unfortunately, there is no direct way. I think the best way would be to generate a list of all targets using

matacache info <database> lin

and check if the columns of the lineages are 0 (which means no valid TaxonId).

@donovan-h-parks
Copy link
Author

Hi,

Thanks for suggesting metacache info <database> lin. I've been able to use this to identify that the issue relates to FASTA files from NCBI that have accessions of a specific form, e.g. NZ_CAJRAF010000001.1. It appears MetaCache munges this name and records it as CAJRAF010000001.1. This appears to be related to the length of the accession as NZ_AUDU01000044.1 is retained as NZ_AUDU01000044.1.

Is this the expected operation of MetaCache and I should take care to remove any characters before an initial underscore from accessions >18 characters when providing a mapping file to -taxpostmap? This seems like a brittle rule for me to implement so was hoping you could provide some details on how MetaCache modifies accessions.

Thanks,
Donovan

@muellan
Copy link
Owner

muellan commented Jan 31, 2023

Hi,

accession numbers are a total mess. We use a regex to identify NCBI-style accession or accession.version sequence identifiers. For some reason that I don't remember we only allow the letter part to be 7 characters long (including the underscore).

If you want a super quick fix, go to the file "src/sequence_io.cpp" line 471 and replace the regex "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,6}[0-9]{5,})(\\.[0-9]+)?)" with "(^|[^[:alnum:]])(([A-Z][_A-Z]{1,9}[0-9]{5,})(\\.[0-9]+)?)" (notice that the 6 got replaced with 9 which allows up to 10 letter characters at the start of the accession id) and recompile metacache.
That should solve your problem. I think we'll include such a change in the next release if it doesn't break anything.

André

@donovan-h-parks
Copy link
Author

Hi,

Thanks. I can look to implement the same regex expression to ensure consistency. Why is modifying accessions necessary? I'd like to add in my own genomes that don't necessarily have NCBI-style accessions. This is made a bit more complicated if I have to account for changes that might be made by MetaCache.

Thanks,
Donovan

@muellan
Copy link
Owner

muellan commented Mar 11, 2024

This should be solved by the changes in version v.2.4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants