Alpha version of HDFS Toolkit, v2.1
Pre-releaseThis alpha release adds the ability to read text and sequence files in parallel via the HadoopReader operator. If you are looking for a production-release, the latest one is v2.0.0.
Unlike the HDFS2FileSource, which reads either lines or binary blobs, the HadoopReader reads key-value pairs. (For text files, the key is the position in the file.)
When in a parallel region, the HadoopReader reads a portion of the file as determined by its channel. Note that files compressed with unsplittable compression cannot be read in parallel, and only channel 0 will produce any tuples. However, sequence files, text files, and text files compressed with splitable compression (ie, with bz2) are read in parallel.
Some limitations of the operator are given here.
The demos/WordCount
directory gives an example of using this operator to do word count.
Note that as this is a pre-release. The operator interface (and even then name) may change, and there is no guarantee that this will be in the the official HDFS toolkit v2.1.0, the next product version, or in any future version. The code is in the SequenceFile branch, not the master branch.