Skip to content

Commit

Permalink
[MRG] update JOSS paper for v4 (#1361)
Browse files Browse the repository at this point in the history
Per openjournals/joss#423, we are planning to
make a new submission to JOSS for sourmash v4.x. The first paper from
2016 is [here](https://joss.theoj.org/papers/10.21105/joss.00027).

Fixes #1321.

Ref #622,
#444.

- [x] update authors per #1367
- [x] fix `@@@` in paper.md YAML header
- [x] review [JOSS
guidelines](https://joss.readthedocs.io/en/latest/submitting.html) and
think about what to add, if anything.
- [x] add funding acks - Moore esp
- [ ] let's fix citation issue finally
#511

---------

Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: N. Tessa Pierce-Ward <[email protected]>
Co-authored-by: Katrin Leinweber <[email protected]>
  • Loading branch information
4 people authored Aug 16, 2023
1 parent c35cc19 commit 6c28366
Show file tree
Hide file tree
Showing 4 changed files with 231 additions and 17 deletions.
23 changes: 23 additions & 0 deletions .github/workflows/draft-pdf.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
on: [push]

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: paper.md
- name: Upload
uses: actions/upload-artifact@v1
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: paper.pdf
61 changes: 60 additions & 1 deletion paper.bib
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
@article{ondov2015fast,
@article{Ondov:2015,
title={Fast genome and metagenome distance estimation using MinHash},
author={Ondov, Brian D and Treangen, Todd J and Mallonee, Adam B and Bergman, Nicholas H and Koren, Sergey and Phillippy, Adam M},
journal={bioRxiv},
Expand All @@ -8,3 +8,62 @@ @article{ondov2015fast
doi={10.1101/029827},
url={https://doi.org/10.1101/029827}
}

@article{Brown:2016,
doi = {10.21105/joss.00027},
url = {https://doi.org/10.21105/joss.00027},
year = {2016},
publisher = {The Open Journal},
volume = {1},
number = {5},
pages = {27},
author = {C. Titus Brown and Luiz Irber},
title = {sourmash: a library for MinHash sketching of DNA},
journal = {Journal of Open Source Software}
}

@article{Pierce:2019,
doi = {10.12688/f1000research.19675.1},
url = {https://doi.org/10.12688/f1000research.19675.1},
year = {2019},
month = jul,
publisher = {F1000 Research Ltd},
volume = {8},
pages = {1006},
author = {N. Tessa Pierce and Luiz Irber and Taylor Reiter and Phillip Brooks and C. Titus Brown},
title = {Large-scale sequence comparisons with sourmash},
journal = {F1000Research}
}
@article{gather,
title={Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers},
author={Irber, Luiz Carlos and Brooks, Phillip T and Reiter, Taylor E and Pierce-Ward, N Tessa and Hera, Mahmudur Rahman and Koslicki, David and Brown, C Titus},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}

@article{branchwater,
title={Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search},
author={Irber, Luiz Carlos and Pierce-Ward, N Tessa and Brown, C Titus},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}

@article{koslicki2019improving,
title={Improving minhash via the containment index with applications to metagenomic analysis},
author={Koslicki, David and Zabeti, Hooman},
journal={Applied Mathematics and Computation},
volume={354},
pages={206--215},
year={2019},
publisher={Elsevier}
}

@article{hera2022debiasing,
title={Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances},
author={Hera, Mahmudur Rahman and Pierce-Ward, N Tessa and Koslicki, David},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
}
162 changes: 147 additions & 15 deletions paper.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,164 @@
---
title: 'sourmash: a library for MinHash sketching of DNA'
title: 'sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets'
tags:
- FracMinHash
- MinHash
- k-mers
- Python
- Rust
authors:
- name: C. Titus Brown
orcid: 0000-0001-6001-2677
affiliation: University of California, Davis
- name: Luiz Irber
orcid: 0000-0003-4371-9659
affiliation: University of California, Davis
date: 13 Sep 2016
equal-contrib: true
affiliation: 1
- name: N. Tessa Pierce-Ward
orcid: 0000-0002-2942-5331
equal-contrib: true
affiliation: 1
- name: Mohamed Abuelanin
orcid: 0000-0002-3419-4785
affiliation: 1
- name: Harriet Alexander
orcid: 0000-0003-1308-8008
affiliation: 2
- name: Abhishek Anant
orcid: 0000-0002-5751-2010
affiliation: 9
- name: Keya Barve
orcid: 0000-0003-3241-2117
affiliation: 1
- name: Colton Baumler
orcid: 0000-0002-5926-7792
affiliation: 1
- name: Olga Botvinnik
orcid: 0000-0003-4412-7970
affiliation: 3
- name: Phillip Brooks
orcid: 0000-0003-3987-244X
affiliation: 1
- name: Daniel Dsouza
orcid: 0000-0001-7843-8596
affiliation: 9
- name: Laurent Gautier
orcid: 0000-0003-0638-3391
affiliation: 9
- name: Mahmudur Rahman Hera
orcid: 0000-0002-5992-9012
affiliation: 4
- name: Hannah Eve Houts
orcid: 0000-0002-7954-4793
affiliation: 1
- name: Lisa K. Johnson
orcid: 0000-0002-3600-7218
affiliation: 1
- name: Fabian Klötzl
orcid: 0000-0002-6930-0592
affiliation: 5
- name: David Koslicki
orcid: 0000-0002-0640-954X
affiliation: 4
- name: Marisa Lim
orcid: 0000-0003-2097-8818
affiliation: 1
- name: Ricky Lim
orcid: 0000-0003-1313-7076
affiliation: 9
- name: Ivan Ogasawara
orcid: 0000-0001-5049-4289
affiliation: 9
- name: Taylor Reiter
orcid: 0000-0002-7388-421X
affiliation: 1
- name: Camille Scott
orcid: 0000-0001-8822-8779
affiliation: 1
- name: Andreas Sjödin
orcid: 0000-0001-5350-4219
affiliation: 6
- name: Daniel Standage
orcid: 0000-0003-0342-8531
affiliation: 7
- name: S. Joshua Swamidass
orcid: 0000-0003-2191-0778
affiliation: 8
- name: Connor Tiffany
orcid: 0000-0001-8188-7720
affiliation: 9
- name: Pranathi Vemuri
orcid: 0000-0002-5748-9594
affiliation: 3
- name: Erik Young
orcid: 0000-0002-9195-9801
affiliation: 1
- name: C. Titus Brown
orcid: 0000-0001-6001-2677
corresponding: true
affiliation: 1
affiliations:
- name: University of California, Davis
index: 1
- name: Woods Hole Oceanographic Institution
index: 2
- name: Chan-Zuckerberg Biohub
index: 3
- name: Pennsylvania State University
index: 4
- name: MPI for Evolutionary Biology
index: 5
- name: Swedish Defence Research Agency (FOI)
index: 6
- name: National Bioforensic Analysis Center
index: 7
- name: Washington University in St Louis
index: 8
- name: No affiliation
index: 9

date: 27 Mar 2023
bibliography: paper.bib
---

# Summary

sourmash is a toolbox for creating, comparing, and manipulating MinHash
sketches of genomic data.
sourmash is a command line tool and Python library for sketching
collections of DNA, RNA, and amino acid k-mers for biological sequence
search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.

FracMinHash sketching is a lossy compression approach that represents
data sets using a "fractional" sketch containing $1/S$ of the original
k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and containment calculations, which enables analysis and comparison of data sets
of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].

# Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data
sets are readily available in biology, and the field needs lightweight
computational methods for searching and summarizing the content of
both public and private collections. sourmash provides a flexible set
of programmatic functionality for this purpose, together with a robust
and well-tested command-line interface. It has been used in well over 200
publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues
to expand in functionality.

# Acknowledgements

MinHash sketches provide a lightweight way to store "signatures" of
large DNA or RNA sequence collections, and then compare or search them
using a Jaccard index. MinHash sketches can be used to identify samples,
find similar samples, identify data sets with shared sequences, and
build phylogenetic trees [@ondov2015fast].
This work is funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB].

sourmash provides a command line script, a Python library, and a CPython
module for MinHash sketches.
Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains
and the publisher, by accepting the article for publication, acknowledges
that the USG retains a non-exclusive, paid-up, irrevocable, world-wide
license to publish or reproduce the published form of this manuscript,
or allow others to do so, for USG purposes. Views and conclusions
contained herein are those of the authors and should not be interpreted
to represent policies, expressed or implied, of the DHS.

# References
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ version = "4.8.3-dev"

authors = [
{ name="Luiz Irber", orcid="0000-0003-4371-9659" },
{ name="N. Tessa Pierce-Ward", orcid="0000-0002-2942-5331" },
{ name="Mohamed Abuelanin", orcid="0000-0002-3419-4785" },
{ name="Harriet Alexander", orcid="0000-0003-1308-8008" },
{ name="Abhishek Anant", orcid="0000-0002-5751-2010" },
Expand All @@ -33,7 +34,6 @@ authors = [
{ name="Marisa Lim", orcid="0000-0003-2097-8818" },
{ name="Ricky Lim", orcid="0000-0003-1313-7076" },
{ name="Ivan Ogasawara", orcid="0000-0001-5049-4289" },
{ name="N. Tessa Pierce", orcid="0000-0002-2942-5331" },
{ name="Taylor Reiter", orcid="0000-0002-7388-421X" },
{ name="Camille Scott", orcid="0000-0001-8822-8779" },
{ name="Andreas Sjödin", orcid="0000-0001-5350-4219" },
Expand Down

0 comments on commit 6c28366

Please sign in to comment.