Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index doesn't work with a text file list of manifests #347

Closed
olgabot opened this issue Jun 7, 2024 · 4 comments · May be fixed by #354
Closed

index doesn't work with a text file list of manifests #347

olgabot opened this issue Jun 7, 2024 · 4 comments · May be fixed by #354

Comments

@olgabot
Copy link
Contributor

olgabot commented Jun 7, 2024

Hello, hope you are well!

I am very excited to try out the low-memory and fast searches created by RocksDB :) (Also, I will definitely be making use of pairwise!)

On my way there, I encountered some unexpected behavior. I had an enormous sequence file (e.g. UniRef50, 65M protein sequences) and cut it up into chunks of 100k sequences to do sourmash scripts manysketch -p protein,scaled=1,k=10,abund without running out of resources.

Then, I wanted to index these many files before searching them, but sourmash scripts index didn't work on a list of manifest files.

Here's a minimal reproduction, using the data in src/python/tests/test-data:

# Make input csv files
echo 'name,genome_filename,protein_filename\nshort,short.fa,' > short.csv 
echo 'name,genome_filename,protein_filename\nshort,short2.fa,' > short2.csv
echo 'name,genome_filename,protein_filename\nshort,short3.fa,' > short3.csv

# Make sketches
sourmash scripts manysketch short.csv -o short.fa.zip -p dna,k=31,scaled=1 
sourmash scripts manysketch short2.csv -o short2.fa.zip -p dna,k=31,scaled=1
sourmash scripts manysketch short3.csv -o short3.fa.zip -p dna,k=31,scaled=1

# Make list of sketches (but they're actually manifests?)
for ZIP in short*.zip; do echo $ZIP >> short_siglist.txt; done

Then, sourmash scripts index fails

$ sourmash scripts index --ksize 31 --scaled 1 -o short_index.rocksdb short_siglist.txt   

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

ksize: 31 / scaled: 1 / moltype: DNA 
indexing all sketches in 'short_siglist.txt'
Loading siglist
Reading signature(s) from: 'short_siglist.txt'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short2.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short.fa.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path 'short3.fa.zip'
No valid signatures found in signature pathlist 'short_siglist.txt'
WARNING: 3 signature paths failed to load. See error messages above.
Error: Signatures failed to load. Exiting.

I'm realizing now that short.zip are manifests and not sigs, but I was confused that sourmash scripts index wasn't able to work with them, because all the parameters matched when doing sourmash sig describe:

$ sourmash sig describe short.fa.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /Users/olgabot/code/sourmash_plugin_branchwater/src/python/tests/test-data/short.fa.zip
signature: short
source file: short.fa
md5: 9191284a3a23a913d8d410f3d53ce8f0
k=31 molecule=DNA num=0 scaled=1 seed=42 track_abundance=0
size: 970
sum hashes: 970
signature license: CC0

loaded 1 signatures total, from 1 files

The workaround is using sourmash sig cat to combine the signatures into one file, but I was hoping not to do this until index creation since the input files are so big.

sourmash sig cat short*.zip -o combined_short.zip 
sourmash scripts index combined_short.zip --ksize 31 --scaled 1 -o short_index.rocksdb 

Let me know if I'm not thinking about this problem correctly and there's a better way to do it.

Hope this was informative! Thank you!

@olgabot olgabot changed the title index doesn't work with multiple manifests index doesn't work with a text file list of manifests Jun 7, 2024
@ctb
Copy link
Collaborator

ctb commented Jun 7, 2024

you are exactly right... they are not yet supported but rather desperately needed (see #266 and #235).

there are a few issues that are likely to take priority over upgrading this behavior - in particular, #322 and #331 are top of my mind right now - but your use case is really important functionality that we hope to implement soon.

@ctb
Copy link
Collaborator

ctb commented Jun 7, 2024

(and yes, I think the documentation is also broken around this behavior. To quote Napoleon, “You can ask me for anything you like, except time” 😭 )

@ctb
Copy link
Collaborator

ctb commented Jun 19, 2024

#364 "fixes" the documentation by commenting out the manifest CSV recommendations until we can support them.

@ctb
Copy link
Collaborator

ctb commented Oct 16, 2024

#430 is merged and released in v0.9.8, and this now works! Per the revised documentation for index, however, the sketches may all be loaded into memory when using index, which is suboptimal. That's for work in the future - being tracked in #415 and sourmash-bio/sourmash#3321.

@ctb ctb closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants