Skip to content

Prototype streaming Hadoop SequenceFile reader for Rust

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Xorlev/sequencefile-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CircleCI Crates.io

sequencefile-rs

Hadoop SequenceFile library for Rust

Documentation

# Cargo.toml
[dependencies]
sequencefile = "0.2.0"

Status

Prototype status!

Unfortunately that means the API will change. If you depend on this crate, please fully qualify your versions for now.

Currently supports reading out your garden-variety sequence file. Handles uncompressed sequencefiles as well as block/record compressed files (deflate, gzip, and bzip2 only). LZO and Snappy are not (yet) handled.

There's a lot more to do:

  • Varint decoding
  • Block sizes are written with Varints
  • Block decompression
  • Gzip support
  • Bzip2 support
  • Sequencefile metadata
  • Better error handling
  • Tests
  • Better error handling2
  • More tests
  • Better documentation
  • Snappy support
  • CRC file support
  • 'Writables', e.g. generic deserialization for common Hadoop writable types
  • Writer
  • Gracefully handle version 4 sequencefiles
  • Zero-copy implementation.
  • LZO support.

Benchmarks

There are only two benchmarks yet. Those two benchmarks read seq files (1000 entries each) generated in java with no compression. Both have Text as keyclass. First has i64 as valueclass, second has some more complex structure. Earlier investigations (with deflate on an early 2012 MBP) showed 98.4% of CPU time was spent in miniz producing ~125MB/s of decompressed data.

Usage

use sequencefile::Writable;
let file = File::open("/path/to/seqfile").expect("cannot open file");

struct ValueClass {
  // some fields
}

impl Writable for ValueClass {
   fn read(buf: &mut impl std::io::Read) -> sequencefile::Result<Self>
    where
        Self: Sized,
    {
      // implement read function
    }
}

let seqfile = sequencefile::Reader::<File, Text, ValueClass>::new(file).expect("cannot open reader");

for kv in seqfile.flatten() {

    println!("{:?} - {:?}", kv.0, kv.1);
}

License

sequencefile-rs is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0), with portions covered by various BSD-like licenses.

See LICENSE-APACHE, and LICENSE-MIT for details.

About

Prototype streaming Hadoop SequenceFile reader for Rust

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published