Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: update documentation for RocksDB index and internal/external storage, + miscellaneous improvements #416

Merged
merged 8 commits into from
Aug 12, 2024

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Aug 12, 2024

Tackling #409 and documenting RocksDB index internal/external storage per #390.

Items from #409:

This PR also updates the docs to:

Fixes #409.


Here are some of the bigger/more confusing changes, for reviewers to evaluate and help improve ;) --

multisearch threshold discussion:

The -t/--threshold for multisearch and pairwise applies to the
containment of query-in-target and defaults to 0.01. To report
any overlap between two sketches, set the threshold to 0.

manysearch output discussion

The results file here, query.x.gtdb-reps.csv, will have the following columns: query, query_md5, match_name, match_md5, containment, jaccard, max_containment, intersect_hashes, query_containment_ani.

If you run manysearch without using a RocksDB database (that is, against regular sketches), the results file will also have the following columns: , match_containment_ani, average_containment_ani, and max_containment_ani.

Finally, if using sketches that have abundance information, the
results file will also contain the following columns: average_abund,
median_abund, std_abund, n_weighted_found, and total_weighted_hashes.

See
[the prefetch CSV output column documentation](https://sourmash.readthedocs.io/\
en/latest/classifying-signatures.html#appendix-e-prefetch-csv-output-columns)
for information on these various columns.

Internal vs external storage of sketches in a RocksDB index

(The below applies to v0.9.7 and later of the plugin; for v0.9.6 and
before, only external storage was implemented.)

RocksDB indexes support containment queries (a la the
branchwater application),
as well as gather-style mixture decomposition (see
Irber et al., 2022).
For this plugin, the manysearch command supports a RocksDB index for
the database for containment queries, and multifastgather can use a
RocksDB index for the database of genomes.

RocksDB indexes contain references to the sketches used to construct
the index. If --internal-storage is set (which is the default), a
copy of the sketches is stored within the RocksDB database directory;
if --no-internal-storage is provided, then the references point to
the original source sketches used to construct the database, wherever
they reside on your disk.

The sketches are not used by manysearch, but are used by
multifastgather: with v0.9.6 and later, you'll get an error if you
run multifastgather against a RocksDB index where the sketches
cannot be loaded.

What this means is therefore a bit complicated, but boils down to
the following two approaches:

  1. The safest thing to do is build a RocksDB index and use internal
    storage (the default). This will consume more disk space but your
    RocksDB database will always be usable for both manysearch and
    multifastgather, as well as the branchwater app.
  2. If you want to avoid storing duplicate copies of your sketches,
    then specify --no-internal-storage and provide a stable absolute
    path to the source sketches. This will again support both
    manysearch and multifastgather, as well as the branchwater app.
    If the source sketches later become unavailable, multifastgather
    will stop working (although manysearch and the branchwater app
    should be fine).

@ctb ctb changed the title WIP: update documentation for RocksDB index and internal/external storage, + miscellaneous improvements MRG: update documentation for RocksDB index and internal/external storage, + miscellaneous improvements Aug 12, 2024
@ctb
Copy link
Collaborator Author

ctb commented Aug 12, 2024

@luizirber @bluegenes your reviews (you can skim the PR description ;)) would be much appreciated. Don't merge yet, we need to get #408 and #390 in first, and then I plan to cut a new release v0.9.7 as soon as this is merged.

@bluegenes
Copy link
Contributor

did you change fastmultigather --> multifastgather elsewhere, or is that just a typo here? :)

@ctb
Copy link
Collaborator Author

ctb commented Aug 12, 2024

did you change fastmultigather --> multifastgather elsewhere, or is that just a typo here? :)

typo!

@ctb
Copy link
Collaborator Author

ctb commented Aug 12, 2024

(fixed)

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ctb ctb merged commit 42b2aae into main Aug 12, 2024
1 check passed
@ctb ctb deleted the update_docs branch August 12, 2024 22:56
@ctb ctb mentioned this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

documentation updates for next release(s)
2 participants