Hadoop SequenceFile library for Rust
# Cargo.toml
[dependencies]
sequencefile = "0.2.0"
Prototype status!
Unfortunately that means the API will change. If you depend on this crate, please fully qualify your versions for now.
Currently supports reading out your garden-variety sequence file. Handles uncompressed sequencefiles as well as block/record compressed files (deflate, gzip, and bzip2 only). LZO and Snappy are not (yet) handled.
There's a lot more to do:
- Varint decoding
- Block sizes are written with Varints
- Block decompression
- Gzip support
- Bzip2 support
- Sequencefile metadata
- Better error handling
- Tests
- Better error handling2
- More tests
- Better documentation
- Snappy support
- CRC file support
- 'Writables', e.g. generic deserialization for common Hadoop writable types
- Writer
- Gracefully handle version 4 sequencefiles
- Zero-copy implementation.
- LZO support.
There are only two benchmarks yet. Those two benchmarks read seq files (1000 entries each) generated in java with no compression. Both have Text as keyclass. First has i64 as valueclass, second has some more complex structure. Earlier investigations (with deflate on an early 2012 MBP) showed 98.4% of CPU time was spent in miniz producing ~125MB/s of decompressed data.
use sequencefile::Writable;
let file = File::open("/path/to/seqfile").expect("cannot open file");
struct ValueClass {
// some fields
}
impl Writable for ValueClass {
fn read(buf: &mut impl std::io::Read) -> sequencefile::Result<Self>
where
Self: Sized,
{
// implement read function
}
}
let seqfile = sequencefile::Reader::<File, Text, ValueClass>::new(file).expect("cannot open reader");
for kv in seqfile.flatten() {
println!("{:?} - {:?}", kv.0, kv.1);
}
sequencefile-rs is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), with portions covered by various BSD-like licenses.
See LICENSE-APACHE, and LICENSE-MIT for details.