Skip to content

Latest commit

 

History

History
79 lines (62 loc) · 2.39 KB

README.md

File metadata and controls

79 lines (62 loc) · 2.39 KB

CircleCI Crates.io

sequencefile-rs

Hadoop SequenceFile library for Rust

Documentation

# Cargo.toml
[dependencies]
sequencefile = "0.2.0"

Status

Prototype status!

Unfortunately that means the API will change. If you depend on this crate, please fully qualify your versions for now.

Currently supports reading out your garden-variety sequence file. Handles uncompressed sequencefiles as well as block/record compressed files (deflate, gzip, and bzip2 only). LZO and Snappy are not (yet) handled.

There's a lot more to do:

  • Varint decoding
  • Block sizes are written with Varints
  • Block decompression
  • Gzip support
  • Bzip2 support
  • Sequencefile metadata
  • Better error handling
  • Tests
  • Better error handling2
  • More tests
  • Better documentation
  • Snappy support
  • CRC file support
  • 'Writables', e.g. generic deserialization for common Hadoop writable types
  • Writer
  • Gracefully handle version 4 sequencefiles
  • Zero-copy implementation.
  • LZO support.

Benchmarks

There are only two benchmarks yet. Those two benchmarks read seq files (1000 entries each) generated in java with no compression. Both have Text as keyclass. First has i64 as valueclass, second has some more complex structure. Earlier investigations (with deflate on an early 2012 MBP) showed 98.4% of CPU time was spent in miniz producing ~125MB/s of decompressed data.

Usage

use sequencefile::Writable;
let file = File::open("/path/to/seqfile").expect("cannot open file");

struct ValueClass {
  // some fields
}

impl Writable for ValueClass {
   fn read(buf: &mut impl std::io::Read) -> sequencefile::Result<Self>
    where
        Self: Sized,
    {
      // implement read function
    }
}

let seqfile = sequencefile::Reader::<File, Text, ValueClass>::new(file).expect("cannot open reader");

for kv in seqfile.flatten() {

    println!("{:?} - {:?}", kv.0, kv.1);
}

License

sequencefile-rs is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), with portions covered by various BSD-like licenses.

See LICENSE-APACHE, and LICENSE-MIT for details.