Replies: 2 comments
-
The primary motivation of noodles is to attempt to provide compliant implementations of well-known large scale genomics specifications, particularly hts-specs. This is quite different from existing implementations, which tend to read and write particular dialects of the formats, either for historical reasons or ignorance of the specifications. As such, it hinders interoperability, reducing confidence when sharing data or collaborating among users of different tools. When strictly dealing with I/O, yes, htslib and noodles are rather similar. Both can read and write records in the supported alignment (SAM, BAM, and CRAM) and variant (VCF and BCF) formats, and both allow random access using an index. The decoders/parsers, however, have different ideologies. htslib is very lenient; whereas, noodles is far more strict. For example, take the following alignment record (
I'd argue that this is not a valid record. The read name is non-ASCII, the positions and quality scores are out of range, the CIGAR operation is invalid, and the auxiliary data tags are both invalid and duplicated. Yet, htslib/samtools will read (and rewrite) this without error.
The typical record reader in noodles (i.e.,
But perhaps this fine; perhaps users want nonsensical and/or bad data. htslib made a choice to prefer performance over conformance. (noodles also provides an alternative lazy/raw record reader, which allows for per field decoding but does not allow rewriting them.) This is likely the major practical difference users will encounter in noodles when compared to htslib, i.e., specification stringency. Some extra notes:
|
Beta Was this translation helpful? Give feedback.
-
Great, thanks for the detailed reply. This makes sense and I think people do care a lot about valid data. I will evaluate our needs and decide which one works best for our project. Thanks so much! |
Beta Was this translation helpful? Give feedback.
-
Hi,
I was looking at using rust for bioinformatics processing, but was wondering what the main difference between noodles and rust-htslib is from a practical user perspective? Is noodles supposed to be faster or simply pure rust or anything? htslib is used widely and well tested so seems like rust-htslib can do almost everything noodles can do, no? Is noodles significantly faster or anything? I am aware that rust-htslib doesn't have bed functionality, though they do have rust-bio that does do beds? Any comments would be appreciated, especially by the authors. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions