Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable listing of zip files in fromfiles for index, gather, search #354

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

olgabot
Copy link
Contributor

@olgabot olgabot commented Jun 13, 2024

Fixes #347, #266, #235.

This PR adds the ability to do index|multigather|multisearch|pairwise|search, using fromfiles containing .zip files created by manysketch.

@ctb
Copy link
Collaborator

ctb commented Jun 13, 2024

maybe of interest - sourmash-bio/sourmash#3150 - please add notes and questions as you have them ;)

@olgabot
Copy link
Contributor Author

olgabot commented Jun 13, 2024

I'm pretty new to Rust so gonna post my attempts and compile errors so far, in case it's helpful. I feel like I'm repeating incantations that I don't fully understand, so hopefully this at least helps me figure out what's going on.

    let collection = collection.or_else(
        lines.par_iter().filter_map(|line| Some(PathBuf::from(line).extension() == "zip")) {
            match collection_from_zipfile(&PathBuf::from(line), &report_type) {
                Ok(coll) => Some((coll, 0)),
                Err(e) => {
                    last_error = Some(e);
                    None
                }
            }
            eprintln!("Matched zipfiles")
        }
    ); 

First compile error: expected one of ')', ',', '.', '?', or an operator, found '{'

error: expected one of `)`, `,`, `.`, `?`, or an operator, found `{`
   --> src/utils.rs:640:92
    |
640 |         lines.par_iter().filter_map(|line| Some(PathBuf::from(line).extension() == "zip")) {
    |                                                                                           -^ expected one of `)`, `,`, `.`, `?`, or an operator
    |                                                                                           |
    |                                                                                           help: missing `,`

The first one I don't understand, because it seems like all the parens match up. I was mixing these two lines:

let collection = if sigpath.extension().map_or(false, |ext| ext == "zip") {

.filter_map(|path| match Signature::from_path(path) {

Second compile error: error[E0599]: expected value, found macro 'line'

error[E0423]: expected value, found macro `line`
   --> src/utils.rs:641:58
    |
641 |             match collection_from_zipfile(&PathBuf::from(line), &report_type) {
    |                                                          ^^^^
    |                                                          |
    |                                                          not a value
    |                                                          help: a local variable with a similar name exists: `lines`

This is me not knowing Rust. How can the path be initialized from the line in filter_map, to be tested for both the extension and then be referenced with &?

Third compile error: error[E0599]: no method named 'or_else' found for struct 'Collection' in the current scope

error[E0599]: no method named `or_else` found for struct `Collection` in the current scope
   --> src/utils.rs:639:33
    |
639 |     let collection = collection.or_else(
    |                      -----------^^^^^^^ method not found in `Collection`

This one I don't understand at all, because collection_from_signature and load_collection door_else all over the place! Does some other thing need to be set for collection ?

collection.or_else(|| match collection_from_manifest(&sigpath, &report_type) {

Fourth compile error: error[E0277]: can't compare 'std::option::Option<&str>' with '&str'

error[E0277]: can't compare `std::option::Option<&str>` with `&str`
   --> src/utils.rs:640:81
    |
640 |         lines.par_iter().filter_map(|line| Some(PathBuf::from(line).extension() == "zip")) {
    |                                                                                 ^^ no implementation for `std::option::Option<&str> == &str`
    |
    = help: the trait `PartialEq<&str>` is not implemented for `std::option::Option<&str>`
    = help: the following other types implement trait `PartialEq<Rhs>`:
              <std::option::Option<Box<U>> as PartialEq<rkyv::niche::option_box::ArchivedOptionBox<T>>>
              <std::option::Option<T> as PartialEq>
              <std::option::Option<U> as PartialEq<rkyv::option::ArchivedOption<T>>>

I'm not clear on how to get the extension of the file specified in the lines. I thought PathBuf.from(line).extension() would create a string that could be compared, like this:

let collection = if sigpath.extension().map_or(false, |ext| ext == "zip") {

Fifth compile error: error[E0308]: mismatched types

error[E0308]: mismatched types
   --> src/utils.rs:645:21
    |
641 | /             match collection_from_zipfile(&PathBuf::from(line), &report_type) {
642 | |                 Ok(coll) => Some((coll, 0)),
643 | |                 Err(e) => {
644 | |                     last_error = Some(e);
645 | |                     None
    | |                     ^^^^ expected `()`, found `Option<_>`
646 | |                 }
647 | |             }
    | |_____________- expected this to be `()`
    |
    = note: expected unit type `()`
                    found enum `std::option::Option<_>`

At this point, I'm lucky to even get into this loop, so I'm not too worried about this yet :)

EDIT: formatting with backticks to make look a bit nicer

@olgabot
Copy link
Contributor Author

olgabot commented Jun 21, 2024

Okay, I think this is closer. I created a collection_from_zipfile_or_signature_or_manifest function that is called within collection_from_pathlist:

fn collection_from_zipfile_or_signature_or_manifest(
    sigpath: &Path,
    report_type: &ReportType,
) -> Result<Option<Collection>, Option<anyhow::Error>> {

Now, the only compiler error is a Collection vs Vec<Collection> issue:

error[E0308]: mismatched types
   --> src/utils.rs:654:9
    |
654 |     Ok((collection, n_failed))
    |         ^^^^^^^^^^ expected `Collection`, found `Vec<Collection>`
    |
    = note: expected struct `Collection`
               found struct `Vec<Collection>`

I'm not fully understanding how to manage this, since I thought the .flatten().collect() would take care of this, but line 650 is actually .flatten().collect::<Vec<Collection>>(). Each entry in pathlist then creates its own Collection, so the output is a vector of Collections. Is there a way to create a Collection from <Vec<Collection>>?

@olgabot
Copy link
Contributor Author

olgabot commented Jun 21, 2024

Hmm from looking in sourmash/src/core/src/collection.rs, should collection_from_pathlist in this repo be using Collection::from_paths from sourmash?

https://github.com/sourmash-bio/sourmash/blob/8a60b8cc47007ca8b05a63b1367bb6c7f2982821/src/core/src/collection.rs#L149

src/utils.rs Outdated
Comment on lines 645 to 657
.filter_map(|path| match collection_from_zipfile_or_signature_or_manifest(&path, &report_type) {
Ok(collection) => {
// For each record in the collection, get its path filename
Some(collection
.unwrap()
.manifest()
.iter()
.filter_map(|record| match record.internal_location(){
PathBuf => Some(location)

})
.collect())
},
Copy link
Contributor Author

@olgabot olgabot Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting compiler error that Some doesn't exist? Not sure what to make of this..

error[E0425]: cannot find value `location` in this scope
   --> src/utils.rs:653:41
    |
653 |                         PathBuf => Some(location)
    |                                         ^^^^^^^^ not found in this scope

src/utils.rs Outdated
Comment on lines 645 to 655
.filter_map(|path| match collection_from_zipfile_or_signature_or_manifest(&path, &report_type) {
Ok(collection) => {
// For each record in the collection, get its path filename
Some(collection
.unwrap()
.manifest()
.iter()
.map(|record| record.internal_location())
.collect())
},
Err(err) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried using map here, but then the type couldn't be inferred:

error[E0283]: type annotations needed
   --> src/utils.rs:653:22
    |
645 |         .filter_map(|path| match collection_from_zipfile_or_signature_or_manifest(&path, &report_type) {
    |          ---------- type must be known at this point
...
653 |                     .collect())
    |                      ^^^^^^^ cannot infer type of the type parameter `B` declared on the method `collect`
    |
    = note: cannot satisfy `_: Send`
note: required by a bound in `rayon::iter::ParallelIterator::filter_map`
   --> /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/mod.rs:840:12
    |
837 |     fn filter_map<P, R>(self, filter_op: P) -> FilterMap<Self, P>
    |        ---------- required by a bound in this associated function
...
840 |         R: Send,
    |            ^^^^ required by this bound in `ParallelIterator::filter_map`
help: consider specifying the generic argument
    |
653 |                     .collect::<Vec<_>>())
    |                             ++++++++++

@olgabot
Copy link
Contributor Author

olgabot commented Jun 23, 2024

I learned that Collection::from_paths doesn't work with .zip files because once I fixed my compile errors, the actual Python test errored out.

It seems that Collection holds Record objects but not Signature objects. However, I need Signatures or PathBuf to create a new Collection, as Collection has these constructors: Collection::from_zipfile, Collection::from_sigs, Collection::from_paths. Turns out that relationship between Record and Signature is not reciprocal, as Record::from_sig exists, but no Record::to_sig method exists. But, Record::internal_location() exists and returns a PathBuf, so my plan is to use that.

My plan now is to:

  1. For each line in the manifest, create a collection -> output: Vec<Collection>
  2. For each collection, iterate over every record in the manifest, and get record.internal_location(), then collect -> output: Vec<Vec<PathBuf>>
  3. Flatten and collect the list of pathbufs -> output: Vec<PathBuf>
  4. Use Collection::from_paths to create a new collection from Vec<PathBuf> -> Collection object, yay!

Let me know if this is on the right track. The code is doing some double nesting things right now which is probably not the best, so happy to change to something more canonical Rust.

I tried both with map #354 (comment) (where the types couldn't be inferred, so I went to filter_map instead) and filter_map #354 (comment) (where somehow the Some keyword didn't work anymore?). So I'm feeling good about progress, that maybe this could be a direction, but still need to figure out the vector mapping logic.

@olgabot olgabot marked this pull request as ready for review June 30, 2024 23:11
@olgabot
Copy link
Contributor Author

olgabot commented Jun 30, 2024

So, this seems to be working! (maybe?) Tests are passing:

maturin develop && pytest -v -k test_index_multiple_manifests --pdb results in :

📦 Built wheel for CPython 3.12 to /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/.tmpW3hV1Q/sourmash_plugin_branchwater-0.9.6.dev0-cp312-cp312-macosx_10_14_x86_64.whl
✏️  Setting installed package as editable
🛠 Installed sourmash_plugin_branchwater-0.9.6.dev0
==================================================================== test session starts =====================================================================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0 -- /Users/olgabot/anaconda3/envs/branchwater-dev/bin/python3.12
cachedir: .pytest_cache
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items / 260 deselected / 1 selected

src/python/tests/test_index.py::test_index_multiple_manifests PASSED                                                                                   [100%]

====================================================================== warnings summary ======================================================================
src/python/tests/sourmash_tst_utils.py:10
  /Users/olgabot/code/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py:10: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================== 1 passed, 260 deselected, 1 warning in 7.36s =======================================================

To inspect the output, added assert False in the test, and was able to see this created directory:

(base)
 Sun 30 Jun - 16:05  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  ll
Permissions Size User    Group Date Modified Name
.rw-r--r--   23k olgabot staff 30 Jun 16:05  2.fa.sig.gz.mf.zip
.rw-r--r--   43k olgabot staff 30 Jun 16:05  47.fa.sig.gz.mf.zip
.rw-r--r--   44k olgabot staff 30 Jun 16:05  63.fa.sig.gz.mf.zip
.rw-r--r--   272 olgabot staff 30 Jun 16:05  manifest_zips.txt
drwxr-xr-x     - olgabot staff 30 Jun 16:05  out.db

out.db is a directory which looks like this

(base)
 Sun 30 Jun - 16:05  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  ll out.db
Permissions Size User    Group Date Modified Name
.rw-r--r--     0 olgabot staff 30 Jun 16:05  000012.log
.rw-r--r--  175k olgabot staff 30 Jun 16:05  000013.sst
.rw-r--r--  1.7k olgabot staff 30 Jun 16:05  000014.sst
.rw-r--r--    16 olgabot staff 30 Jun 16:05  CURRENT
.rw-r--r--    36 olgabot staff 30 Jun 16:05  IDENTITY
.rw-r--r--     0 olgabot staff 30 Jun 16:05  LOCK
.rw-r--r--   60k olgabot staff 30 Jun 16:05  LOG
.rw-r--r--   642 olgabot staff 30 Jun 16:05  MANIFEST-000005
.rw-r--r--   15k olgabot staff 30 Jun 16:05  OPTIONS-000009
.rw-r--r--   15k olgabot staff 30 Jun 16:05  OPTIONS-000011

But unclear if this output folder is the correct out.db that will work with other sourmash commands.

To try using the created database, I did sourmash scripts fastmultigather 63.fa.sig.gz.mf.zip out.db (one of the input scripts), but ran into an error:

(branchwater-dev)
 Sun 30 Jun - 16:08  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  sourmash scripts fastmultigather 63.fa.sig.gz.mf.zip out.db

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.6.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in '63.fa.sig.gz.mf.zip' against 'out.db' using 16 threads
Loaded DB
Reading query(s) from: '63.fa.sig.gz.mf.zip'
Loaded 1 query signature(s)
thread '<unnamed>' panicked at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/storage.rs:628:48:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/olgabot/anaconda3/envs/branchwater-dev/bin/sourmash", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/olgabot/anaconda3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/Users/olgabot/code/sourmash_plugin_branchwater/src/python/sourmash_plugin_branchwater/__init__.py", line 163, in main
    status = sourmash_plugin_branchwater.do_fastmultigather(args.query_paths,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

With the backtrace,

Full `sourmash scripts fastmultigather` output with backtrace
(branchwater-dev)
 ✘  Sun 30 Jun - 16:08  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  export RUST_BACKTRACE=full RUST_LOG=trace
(branchwater-dev)
 Sun 30 Jun - 16:14  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  sourmash scripts fastmultigather 63.fa.sig.gz.mf.zip out.db

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.6.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in '63.fa.sig.gz.mf.zip' against 'out.db' using 16 threads
Loaded DB
Reading query(s) from: '63.fa.sig.gz.mf.zip'
Loaded 1 query signature(s)
thread '<unnamed>' panicked at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/storage.rs:628:48:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:        0x155aa9e65 - std::backtrace_rs::backtrace::libunwind::trace::h6f224ea69532b9c4
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
   1:        0x155aa9e65 - std::backtrace_rs::backtrace::trace_unsynchronized::h7e173ef23e4514eb
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x155aa9e65 - std::sys_common::backtrace::_print_fmt::hb68789b79179df75
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:68:5
   3:        0x155aa9e65 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h46bc8e86d021398c
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:44:22
   4:        0x155ac8fc3 - core::fmt::rt::Argument::fmt::hf606bd67a54daa54
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/fmt/rt.rs:142:9
   5:        0x155ac8fc3 - core::fmt::write::h1c24e945c7ef8357
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/fmt/mod.rs:1120:17
   6:        0x155aa77ae - std::io::Write::write_fmt::hd9fe01a40900d808
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/io/mod.rs:1846:15
   7:        0x155aa9c39 - std::sys_common::backtrace::_print::h3bfcf7a1dfb9a746
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:47:5
   8:        0x155aa9c39 - std::sys_common::backtrace::print::hdbfcabb7d363540a
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:34:9
   9:        0x155aab4c5 - std::panicking::default_hook::{{closure}}::hecb24382dfeeea98
  10:        0x155aab23e - std::panicking::default_hook::h78dca947ab9e89eb
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:292:9
  11:        0x155aab933 - std::panicking::rust_panic_with_hook::hb83cee65df04957a
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:781:13
  12:        0x155aab80c - std::panicking::begin_panic_handler::{{closure}}::hbe6e44d88f82bc77
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:651:13
  13:        0x155aaa359 - std::sys_common::backtrace::__rust_end_short_backtrace::h6560cb9c1f1224c1
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/sys_common/backtrace.rs:171:18
  14:        0x155aab586 - rust_begin_unwind
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:647:5
  15:        0x155aef125 - core::panicking::panic_fmt::h2aac8cf45f7ae617
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72:14
  16:        0x155aef1f7 - core::panicking::panic::h4a7e975994ee54c1
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:144:5
  17:        0x155aef088 - core::option::unwrap_failed::h46e1b950e2bec24f
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/option.rs:1978:5
  18:        0x15541d51c - core::option::Option<T>::unwrap::ha844df2bb8f7c47f
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/option.rs:931:21
  19:        0x15541d51c - <sourmash::storage::MemStorage as sourmash::storage::Storage>::load_sig::h91196c07e71c948f
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/storage.rs:628:12
  20:        0x1553ea261 - <std::sync::rwlock::RwLock<L> as sourmash::storage::Storage>::load_sig::hae08c7c075c3a91f
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/storage.rs:215:9
  21:        0x155419dc6 - <sourmash::storage::InnerStorage as sourmash::storage::Storage>::load_sig::h5fdb003b953dd3b8
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/storage.rs:171:25
  22:        0x1553c6c9b - sourmash::collection::Collection::sig_for_dataset::h9b931ea66e8df55a
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/collection.rs:178:19
  23:        0x15537b90f - <sourmash::index::revindex::disk_revindex::RevIndex as sourmash::index::revindex::RevIndexOps>::gather::hab3b73cd197c9e6a
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/index/revindex/disk_revindex.rs:339:29
  24:        0x155167e81 - <sourmash::index::revindex::RevIndex as sourmash::index::revindex::RevIndexOps>::gather::h9a74d626905065be
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.14.1/src/index/revindex/mod.rs:50:1
  25:        0x1551647b7 - sourmash_plugin_branchwater::mastiff_manygather::mastiff_manygather::{{closure}}::h73189abd155ec8b5
                               at /Users/olgabot/code/sourmash_plugin_branchwater/src/mastiff_manygather.rs:70:39
  26:        0x1551af37b - <rayon::iter::filter_map::FilterMapFolder<C,P> as rayon::iter::plumbing::Folder<T>>::consume::he68eca72d1c01790
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/filter_map.rs:123:36
  27:        0x1551b3186 - rayon::iter::plumbing::Folder::consume_iter::h7aeb3ccd5b6f1a32
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:178:20
  28:        0x1551884ad - <rayon::iter::map::MapFolder<C,F> as rayon::iter::plumbing::Folder<T>>::consume_iter::ha7b9c9287517aaa6
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/map.rs:248:21
  29:        0x15510afdd - rayon::iter::plumbing::Producer::fold_with::h15711142466a1e6c
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:109:9
  30:        0x1551bb435 - rayon::iter::plumbing::bridge_producer_consumer::helper::h70e96589b0d62aac
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:437:13
  31:        0x1551b6df1 - rayon::iter::plumbing::bridge_producer_consumer::h833f1810d0d01c85
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:396:12
  32:        0x1551b5bef - <rayon::iter::plumbing::bridge::Callback<C> as rayon::iter::plumbing::ProducerCallback<I>>::callback::h80010e9b057f1ad1
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:372:13
  33:        0x15510ad4a - <<rayon::iter::enumerate::Enumerate<I> as rayon::iter::IndexedParallelIterator>::with_producer::Callback<CB> as rayon::iter::plumbing::ProducerCallback<I>>::callback::h64c28e23065c8fab
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/enumerate.rs:78:17
  34:        0x1551ace5f - <rayon::slice::Iter<T> as rayon::iter::IndexedParallelIterator>::with_producer::hfee0d6b876cd6531
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/slice/mod.rs:826:9
  35:        0x15510b891 - <rayon::iter::enumerate::Enumerate<I> as rayon::iter::IndexedParallelIterator>::with_producer::h5f3cb5d218edf374
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/enumerate.rs:62:16
  36:        0x1551c6f3d - rayon::iter::plumbing::bridge::hfa21054829fe1a61
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/plumbing/mod.rs:356:12
  37:        0x15510b5dc - <rayon::iter::enumerate::Enumerate<I> as rayon::iter::ParallelIterator>::drive_unindexed::h764ef738ae0ba75b
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/enumerate.rs:38:9
  38:        0x155187eec - <rayon::iter::map::Map<I,F> as rayon::iter::ParallelIterator>::drive_unindexed::h70798aec081ff73d
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/map.rs:49:9
  39:        0x1551b46d8 - <rayon::iter::filter_map::FilterMap<I,P> as rayon::iter::ParallelIterator>::drive_unindexed::h912e986c1f764d91
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/filter_map.rs:46:9
  40:        0x1550c16e5 - <rayon::iter::flatten::Flatten<I> as rayon::iter::ParallelIterator>::drive_unindexed::h00791cf5f0798cce
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/flatten.rs:38:9
  41:        0x1550ac454 - <rayon::iter::map_with::MapWith<I,T,F> as rayon::iter::ParallelIterator>::drive_unindexed::hcafba9497f179ebc
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/map_with.rs:53:9
  42:        0x1550f6acf - rayon::iter::try_reduce::try_reduce::h7a68390d0648840f
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/try_reduce.rs:21:5
  43:        0x1550abd64 - rayon::iter::ParallelIterator::try_reduce::h7aab82da02d0d392
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/mod.rs:1087:9
  44:        0x1550c11c1 - rayon::iter::ParallelIterator::try_for_each_with::h568bd72808bd6d49
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-1.10.0/src/iter/mod.rs:523:9
  45:        0x1551f914e - sourmash_plugin_branchwater::mastiff_manygather::mastiff_manygather::h9fdf9888d2a86984
                               at /Users/olgabot/code/sourmash_plugin_branchwater/src/mastiff_manygather.rs:54:16
  46:        0x15517eafe - sourmash_plugin_branchwater::do_fastmultigather::h814b200ad22dec20
                               at /Users/olgabot/code/sourmash_plugin_branchwater/src/lib.rs:122:15
  47:        0x15517f93a - sourmash_plugin_branchwater::__pyfunction_do_fastmultigather::hb64ce83255049474
                               at /Users/olgabot/code/sourmash_plugin_branchwater/src/lib.rs:106:1
  48:        0x1551942d6 - pyo3::impl_::trampoline::fastcall_with_keywords::{{closure}}::h0f8ff0dc78ec7ff5
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.21.2/src/impl_/trampoline.rs:45:29
  49:        0x15519422b - pyo3::impl_::trampoline::trampoline::{{closure}}::hf36748a7cd33a7a3
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.21.2/src/impl_/trampoline.rs:187:54
  50:        0x1550e7ff6 - std::panicking::try::do_call::h46969c12a76f2fc1
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:554:40
  51:        0x1550ed8cd - ___rust_try
  52:        0x1550e37b7 - std::panicking::try::h7589864a77e3035a
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:518:19
  53:        0x1550a1664 - std::panic::catch_unwind::h9ba174ca62996f50
                               at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panic.rs:142:14
  54:        0x155194151 - pyo3::impl_::trampoline::trampoline::hd3b63f930fcf6ac6
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.21.2/src/impl_/trampoline.rs:187:9
  55:        0x15509732d - pyo3::impl_::trampoline::fastcall_with_keywords::hfaee83008e4857ad
                               at /Users/olgabot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.21.2/src/impl_/trampoline.rs:45:13
  56:        0x15517efe4 - sourmash_plugin_branchwater::<impl sourmash_plugin_branchwater::do_fastmultigather::MakeDef>::_PYO3_DEF::trampoline::h1b0e605360433e6a
                               at /Users/olgabot/code/sourmash_plugin_branchwater/src/lib.rs:106:1
  57:        0x100d863e6 - _cfunction_vectorcall_FASTCALL_KEYWORDS
  58:        0x100e8d00b - __PyEval_EvalFrameDefault
  59:        0x100e4a1ee - _PyEval_EvalCode
  60:        0x100f121a7 - _run_mod
  61:        0x100f11f75 - _pyrun_file
  62:        0x100f119ff - __PyRun_SimpleFileObject
  63:        0x100f11364 - __PyRun_AnyFileObject
  64:        0x100f4064b - _pymain_run_file_obj
  65:        0x100f3ff19 - _pymain_run_file
  66:        0x100f3f78c - _Py_RunMain
  67:        0x100ca23a8 - _main
  68:     0x7ff809ebe366 - <unknown>
Traceback (most recent call last):
  File "/Users/olgabot/anaconda3/envs/branchwater-dev/bin/sourmash", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/olgabot/anaconda3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/Users/olgabot/code/sourmash_plugin_branchwater/src/python/sourmash_plugin_branchwater/__init__.py", line 163, in main
    status = sourmash_plugin_branchwater.do_fastmultigather(args.query_paths,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

However, it seems that manysearch works:

Hmmm, but `sourmash scripts manysearch` works:

```rust
(branchwater-dev)
 ✘  Sun 30 Jun - 16:16  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  sourmash scripts manysearch -o manysearch.txt 63.fa.sig.gz.mf.zip out.db

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

=> sourmash_plugin_branchwater 0.9.6.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold: 0.01
searching all sketches in '63.fa.sig.gz.mf.zip' against 'out.db' using 16 threads
selection scaled: Some(1000)
Loaded DB
Reading query(s) from: '63.fa.sig.gz.mf.zip'
Loaded 1 query signature(s)
DONE. Processed 1 search sigs
...manysearch is done! results in 'manysearch.txt'
(branchwater-dev)
 Sun 30 Jun - 16:16  /var/folders/7z/r1593ybs1sj2ks5zzl9vy8840000gn/T/sourmashtest_22dlbn8t 
 @olgabot  cat manysearch.txt
query_name,query_md5,match_name,containment,intersect_hashes,match_md5,jaccard,max_containment,query_containment_ani
"NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence",38729c6374925585db28916b82a6f513,"NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence",1.0,5238,,,,1.0
"NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence",38729c6374925585db28916b82a6f513,"NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete sequence",0.48281786941580757,2529,,,,0.9767860811200507

Thoughts? Happy to hear any Rust suggestions as I copied some internal code from SigStore::data which is a private method and got pretty confused on all the paths vs strings vs os_string things.

Let me know what you think!

@ctb
Copy link
Collaborator

ctb commented Jun 30, 2024

Hi @olgabot I can't quite tell what you're trying to do in this PR 😅

The code in test_index_multiple_manifests is creating zip files, not manifests - each .mf.zip file is a zip file containing one sketch, right?

% sourmash sig cat src/python/tests/test-data/63.fa.sig.gz -o xyz.mf.zip
...
% unzip -v xyz.mf.zip
Archive:  xyz.mf.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   43584  Stored    43584   0% 06-30-2024 16:38 903b1180  signatures/38729c6374925585db28916b82a6f513.sig.gz
     337  Defl:N      253  25% 06-30-2024 16:38 b1f33f83  SOURMASH-MANIFEST.csv
--------          -------  ---                            -------
   43921            43837   0%                            2 files

By contrast, over in #235,

% sourmash sig collect src/python/tests/test-data/63.fa.sig.gz -o mf.csv -F csv  
...
% cat mf.csv
# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
src/python/tests/test-data/63.fa.sig.gz,38729c6374925585db28916b82a6f513,38729c63,31,DNA,0,1000,5238,False,"NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence",podar-ref/63.fa

creates a standalone manifest CSV file - a CSV file that contains a pointer to sketches in another file.

Revisiting your test, you're essentially doing this, right?

sourmash sig cat src/python/tests/test-data/63.fa.sig.gz -o 63.sig.zip
sourmash sig cat src/python/tests/test-data/47.fa.sig.gz -o 47.sig.zip
ls -1 47.sig.zip 63.sig.zip > list.txt

and then running things like manysearch on list.txt.

This also doesn't currently work (except maybe in this branch!) and it's great to get it working, but it's different from manifests.

@ctb
Copy link
Collaborator

ctb commented Jun 30, 2024

when I run manysearch on list.txt in the main branch, I get:

% sourmash scripts manysearch list.txt list.txt -o xyz.csv

== This is sourmash version 4.8.11.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

=> sourmash_plugin_branchwater 0.9.6.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold: 0.01
searching all sketches in 'list.txt' against 'list.txt' using 8 threads
selection scaled: Some(1000)
Reading query(s) from: 'list.txt'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path '47.sig.zip'
Sketch loading error: expected value at line 1 column 1
WARNING: could not load sketches from path '63.sig.zip'
No valid signatures found in query pathlist 'list.txt'
WARNING: 2 query paths failed to load. See error messages above.
Error: No query signatures loaded, exiting.

If that's what you want to get working, then a better PR title here would be "enable listing of zip files in fromfiles".

@ctb
Copy link
Collaborator

ctb commented Jun 30, 2024

(and I verify that under this PR, the above command works fine 🎉 )

@olgabot
Copy link
Contributor Author

olgabot commented Jul 1, 2024

Hi @olgabot I can't quite tell what you're trying to do in this PR 😅

Yes, sorry, I got confused by manifests vs zip files vs signatures here 😅 . The goal of this PR is to allow for sourmash scripts index to work on a pathlist.txt containing zip files. The suggested title of "enable listing of zip files in fromfiles" doesn't make sense to me because the main affected method is collection_from_pathlist called by load_collection, and I don't know what "fromfiles" means, sorry. What about "Allow lists of zip files to work in collection_from_pathlist"?

@olgabot
Copy link
Contributor Author

olgabot commented Jul 1, 2024

Whelp, I spoke too soon ... tried running the full test suite locally and even the first few test scripts had a ton of errors:

(branchwater-dev)
 ✘  Sun 30 Jun - 17:35~/code/sourmash_plugin_branchwater   originolgabot/index-multiple-manifests 1● 
 @olgabotmake test
python -m pytest
==================================================================== test session starts =====================================================================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items

src/python/tests/test_cluster.py ..........FFFF..                                                                                                      [  6%]
src/python/tests/test_gather.py .F.F...FF...FF.F...F.F.F.F.F....F.FF                                                                                   [ 19%]
src/python/tests/test_index.py .F......FF.F....FF                                                                                                      [ 26%]
src/python/tests/test_multigather.py .FFFFFFFF..FFFF.FFFFFFFFF.FF.FFFFFFFFFF.FF......FF                                                                [ 45%]
src/python/tests/test_multisearch.py .FFF.FFF.FFF....FF.FFF

More work to be done!

@ctb
Copy link
Collaborator

ctb commented Jul 1, 2024

Hi @olgabot I can't quite tell what you're trying to do in this PR 😅

Yes, sorry, I got confused by manifests vs zip files vs signatures here 😅 . The goal of this PR is to allow for sourmash scripts index to work on a pathlist.txt containing zip files. The suggested title of "enable listing of zip files in fromfiles" doesn't make sense to me because the main affected method is collection_from_pathlist called by load_collection, and I don't know what "fromfiles" means, sorry. What about "Allow lists of zip files to work in collection_from_pathlist"?

heh, I almost had suggested pathlist.

We (I) have been inconsistent, it seems - the branchwater plugin documentation uses fromfile (e.g. link), while sourmash uses pathlist (e.g. see refactoring here).

So anyway two comments -

  • pathlists are suboptimal, because it requires two "reads" for each file - first, reading to get a manifest, and then again to actually load the sketch. Benchmarks show it's not that big a deal in the end, at least not for small files.
  • I would generally suggest trying to name PRs that improve command-line functionality after whatever terminology we use in the docs; this is because those titles end up in the release docs and presumably someone might actually read those. In practice this probably doesn't matter. But that's why I suggested 'fromfile' rather than basing things on the internal names. Given the sourmash inconsistencies above I'd suggest actually fixing the docs here to use pathlist but that's not really your problem to deal with!

tl;dr it would be good to fix the PR name/description somehow.

On a side note -

This issue is also worth skimming for @bluegenes discussion of how the code works: #264

@olgabot olgabot changed the title Index manifest zip files Enable listing of zip files in fromfiles for index, gather, search Jul 3, 2024
@olgabot
Copy link
Contributor Author

olgabot commented Jul 3, 2024

Hello,

Hope you are doing well! I was able to get my ONE test to work, but then it broke all the other tests. Let me share what I’ve tried so far.

The changes in this PR:

  • Reads a pathlist.txt aka fromfiles to create a Collection from each entry, which could be any of:
    • .sig
    • .sig.gz
    • .zip
    • manifest.csv
  • Iterates on the Collection with par_iter'd on to create (Idx, &Record) objects
  • Creates SigStore objects from Record objects
    • Record objects can be used by Collection.sig_from_record to create a SigStore
  • The key issue is to convert SigStore to a Signature, because Vec<Signature> is the type that is allowed to be put into Collection.from_sigs

How to convert SigStore -> Signature? --> I am struggling to convert SigStore to Signature

Tried dereferencing, but Copy trait not implemented on Signature

From the documentation, dereferencing SigStore creates a Signature :

image

However, *sigstore doesn't work because Copy is not implemented for Signature:

error[E0507]: cannot move out of dereference of `SigStore`
   --> src/utils.rs:692:27
    |
692 |                 let sig = *sigstore;
    |                           ^^^^^^^^^ move occurs because value has type `Signature`, which does not implement the `Copy` trait

Tried sigstore.data.get() but .data is a private field

I saw that SigStore.data -> Signature and said yay! But then it's a private field and I can't access it:

error[E0616]: field `data` of struct `SigStore` is private
   --> src/utils.rs:693:45
    |
693 |                 if let Some(sig) = sigstore.data.get() {
    |                                             ^^^^ private field

Works for my case: Copied the code in SigStore.data to create a signature, but Collection.load is not implemented for all Storage types

What I ended up doing was copying the SigStore.data source code to use the collection.storage() to load a signature:

let raw = collection
    .storage()
    .load(
        &<PathBuf as Clone>::clone(&record.internal_location())
            .into_os_string()
            .into_string()
            .unwrap(),
    )
    .unwrap();
let sig = Signature::from_reader(&mut &raw[..])
    .unwrap()
    // TODO: select the right sig?
    .swap_remove(0);
Some(sig)

However, this causes test errors:

(branchwater-dev)
 Wed  3 Jul - 07:50~/code/sourmash_plugin_branchwater   originolgabot/index-multiple-manifests ✔ 
 @olgabotpytest -k index
==================================================================== test session starts =====================================================================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items / 226 deselected / 35 selected

src/python/tests/test_gather.py F                                                                                                                      [  2%]
src/python/tests/test_index.py .F......FF.F....FF                                                                                                      [ 54%]
src/python/tests/test_multigather.py FFFFFF...FF                                                                                                       [ 85%]
src/python/tests/test_search.py FF...                                                                                                                  [100%]

========================================================================== FAILURES =========================================================================

Looking into the errors, certain kinds of Storage don’t have load, but they do have load_sig, which creates a SigStore, and now I’m back in the same problem of needing to create Signature from SigStore!

From sourmash/src/storage.rs:

impl Storage for MemStorage {
    fn save(&self, _path: &str, _content: &[u8]) -> Result<String> {
        unimplemented!()
    }

		/// boo unimplemented!
    fn load(&self, _path: &str) -> Result<Vec<u8>> {
        unimplemented!()
    }

    fn args(&self) -> StorageArgs {
        unimplemented!()
    }

    fn load_sig(&self, path: &str) -> Result<SigStore> {
        Ok(self.sigs.read().unwrap().get(path).unwrap().clone())
    }

    fn save_sig(&self, path: &str, sig: Signature) -> Result<String> {
        // side-step saving to store
        let sig_store: SigStore = sig.into();
        self.sigs.write().unwrap().insert(path.into(), sig_store);
        Ok(path.into())
    }

    fn spec(&self) -> String {
        "memory://".into()
    }
}

SigStore.storage is a private field

In the original code for SigStore.data, it uses the SigStore.storage, not Collection.storage as above. However, when I tried using SigStorage, I learned that this is also a private field to SigStore:

error[E0616]: field `storage` of struct `SigStore` is private
   --> src/utils.rs:695:56
    |
695 |                 } else if let Some(storage) = sigstore.storage {
    |                                                        ^^^^^^^ private field

The Ask

  • Is there something simple I'm missing in how to get the Signature objects out of a Collection?
  • In sourmash, can access to SigStore.data be made public? Or otherwise a SigStore::to_signature method?

@olgabot
Copy link
Contributor Author

olgabot commented Jul 3, 2024

The Ask

  • Is there something simple I'm missing in how to get the Signature objects out of a Collection?
  • In sourmash, can access to SigStore.data be made public? Or otherwise a SigStore::to_signature method?

Yes, I am missing something simple... Signature::from can take a SigStore and make a Signature ... 🙈

image

Let me see if I can get this to work.

Previous errors:

============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items

src/python/tests/test_cluster.py ..........FFFF..                        [  6%]
src/python/tests/test_gather.py .F.F...FF...FF.F...F.F.F.F.F....F.FF     [ 19%]
src/python/tests/test_index.py .F......FF.F....FF                        [ 26%]
src/python/tests/test_multigather.py .FFFFFFFF..FFFF.FFFFFFFFF.FF.FFFFFF [ 40%]
FFFF.FF......FF                                                          [ 45%]
src/python/tests/test_multisearch.py .FFF.FFF.FFF....FF.FFFF.FFFFFFFFF.. [ 59%]
..F                                                                      [ 60%]
src/python/tests/test_pairwise.py .F.F.F...F....F.F.F....FFF             [ 70%]
src/python/tests/test_search.py .FFF.FFFFFFFFFF.F.F.FFFFF.FFFF.FFFFFFFFF [ 85%]
FFFFFFFF......                                                           [ 91%]
src/python/tests/test_sketch.py ....................F.F                  [100%]

New errors (yay there are fewer!)

============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items

src/python/tests/test_cluster.py ................                        [  6%]
src/python/tests/test_gather.py ............FF.F...................F     [ 19%]
src/python/tests/test_index.py ........FF........                        [ 26%]
src/python/tests/test_multigather.py ......FFF.....F..FF.F.F...F..F....F [ 40%]
F.....F......FF                                                          [ 45%]
src/python/tests/test_multisearch.py .................F....F............ [ 59%]
...                                                                      [ 60%]
src/python/tests/test_pairwise.py .........F................             [ 70%]
src/python/tests/test_search.py .......................FF....F.......... [ 85%]
..............                                                           [ 91%]
src/python/tests/test_sketch.py ....................F.F                  [100%]

@olgabot
Copy link
Contributor Author

olgabot commented Jul 5, 2024

Got rid of errors in test_index.py, but still more to do:

(branchwater-dev) (base) 
 ✘  ♥ 77%Fri  5 Jul - 12:00~/code/sourmash_plugin_branchwater   originolgabot/index-multiple-manifests1 ✔ 
 @olgabotpytest | head -n 20 
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-8.2.1, pluggy-1.5.0
rootdir: /Users/olgabot/code/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 261 items

src/python/tests/test_cluster.py ................                        [  6%]
src/python/tests/test_gather.py ............FF.F...................F     [ 19%]
src/python/tests/test_index.py ..................                        [ 26%]
src/python/tests/test_multigather.py ......FFF.....F..FF.F.F...F..F....F [ 40%]
F.....F......FF                                                          [ 45%]
src/python/tests/test_multisearch.py .................F....F............ [ 59%]
...                                                                      [ 60%]
src/python/tests/test_pairwise.py .........F................             [ 70%]
src/python/tests/test_search.py .......................FF....F.......... [ 85%]
..............                                                           [ 91%]
src/python/tests/test_sketch.py ....................F.F                  [100%]

@ctb
Copy link
Collaborator

ctb commented Jul 13, 2024

hi @olgabot - not sure if this applies to this PR, but I wanted to alert you to some considerations that are coming up for manifests and relative path interpretation over in #390 (comment). Happy to chat more if you think it is relevant.

@ctb
Copy link
Collaborator

ctb commented Jul 13, 2024

(I think as long as you are working with fromfiles/pathlists you're fine, and the main recommendation for you currently is - get things working for absolute paths first, because relative path interpretation is a bit wibbly and might need some tender loving care in the future.)

@ctb
Copy link
Collaborator

ctb commented Aug 21, 2024

@olgabot I tackled this from another angle - standalone manifests - and I ended up writing a whole new struct, MultiCollection. Main PR here, #434, working to get it to pass all the tests here: #430. I think this will address most of what you set out to accomplish here, although index is proving to be a bit of a bear.

@olgabot
Copy link
Contributor Author

olgabot commented Sep 7, 2024

Thanks @ctb! Sorry I missed your message, it's been a bit crazy on my side.

If index isn't working yet, I'll try manysearch on 644 zipped files, each containing of 100k singleton signatures.

@olgabot
Copy link
Contributor Author

olgabot commented Sep 7, 2024

Hmm though it does seem like index may be working now from sourmash-bio/sourmash#3250 ... let me try that

@ctb
Copy link
Collaborator

ctb commented Sep 7, 2024

Hmm though it does seem like index may be working now from sourmash-bio/sourmash#3250 ... let me try that

(right, for internal storage it can work, but ... probably not yet ;)

@olgabot
Copy link
Contributor Author

olgabot commented Sep 7, 2024

Hmm though it does seem like index may be working now from sourmash-bio/sourmash#3250 ... let me try that

(right, for internal storage it can work, but ... probably not yet ;)

Internal storage is fine for me since all the files are on s3 anyway, so referencing other blobs isn't the best idea. However, I'm still not able to do sourmash index list_of_zips.txt :( From what I can tell, I think that's what #430 is implementing with MultiCollection, is that right?

Error message from `sourmash index list_of_zips.txt`
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/kmerseek] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_KMERSEEK:KMERSEEK:INDEX:SOURMASH_INDEX (uniprotkb_Synaptosomal_associated_2024_05_11.fasta__k5)'

Caused by:
  Process `NFCORE_KMERSEEK:KMERSEEK:INDEX:SOURMASH_INDEX (uniprotkb_Synaptosomal_associated_2024_05_11.fasta__k5)` terminated with an error exit status (1)


Command executed:

  # To avoid long argument lists with "too many arguments" error, we will create a manifest file
  # so create a CSV file with the signature file name
  touch uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv
  for sig in uniprotkb_Synaptosomal_associated_2024_05_11.part_002.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_001.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_003.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_005.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_006.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_004.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_008.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_007.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_009.fasta.protein.k5.sig.zip uniprotkb_Synaptosomal_associated_2024_05_11.part_010.fasta.protein.k5.sig.zip; do
      echo $sig >> uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv
  done


  # Branchwater version = "sourmash scripts" for now
  sourmash scripts index \
      --cores 2 \
      --ksize 5 \
      --moltype protein \
      --scaled 1 \
      --output 'uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5.index.rocksdb' \
      uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_KMERSEEK:KMERSEEK:INDEX:SOURMASH_INDEX":
      sourmash: $(echo $(sourmash --version 2>&1) | sed 's/^sourmash //' )
  END_VERSIONS

Command exit status:
  1

Command output:
  Loading siglist

Command error:


  == This is sourmash version 4.8.11. ==

  == Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==


  ksize: 5 / scaled: 1 / moltype: protein

  indexing all sketches in 'uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv'
  Reading analysis(s) from: 'uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv'
  Loading siglist
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_002.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_004.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_008.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_001.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_007.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_003.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_009.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_005.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_010.fasta.protein.k5.sig.zip'
  Sketch loading error: expected value at line 1 column 1
  WARNING: could not load sketches from path 'uniprotkb_Synaptosomal_associated_2024_05_11.part_006.fasta.protein.k5.sig.zip'
  No valid signatures found in analysis pathlist 'uniprotkb_Synaptosomal_associated_2024_05_11.fasta.protein.k5__filelist.csv'
  WARNING: 10 analysis paths failed to load. See error messages above.
  Error: Signatures failed to load. Exiting.

Work dir:
  /Users/olgabot/code/nf-core-kmerseek/work/2d/f4341b4680137ac48ba80d37e8a8ac

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

 -- Check '.nextflow.log' file for details

@ctb
Copy link
Collaborator

ctb commented Sep 7, 2024

Internal storage is fine for me since all the files are on s3 anyway, so referencing other blobs isn't the best idea. However, I'm still not able to do sourmash index list_of_zips.txt :( From what I can tell, I think that's what #430 is implementing with MultiCollection, is that right?

#430 will probably not implement this for sourmash scripts index, specifically, but will for everything else.

For now, RevIndex with non-internal storage requires a single Storage struct, which works out to a single zip file, sig.gz file, or other kind of supported backend.

Although, I didn't think about supporting it just for internal storage... that should actually work. Will see what I can do :).

@ctb
Copy link
Collaborator

ctb commented Sep 9, 2024

For now, RevIndex with non-internal storage requires a single Storage struct, which works out to a single zip file, sig.gz file, or other kind of supported backend.

Although, I didn't think about supporting it just for internal storage... that should actually work. Will see what I can do :).

Implemented in #451, which I'll be merging into #450 and #430. I'll update here when it's officially tested, but it should work now.

@ctb
Copy link
Collaborator

ctb commented Oct 16, 2024

#430 is merged, and all the functionality is released in sourmash_plugin_branchwater v0.9.8 🎉 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

index doesn't work with a text file list of manifests
2 participants