Investigate keeping processes around #5

gtoonstra · 2015-07-04T12:10:32Z

The map/reduce examples have clear boundaries between startup, reading data, processing data and writing it out to disk. The process lifetime doesn't extend beyond those boundaries, which always perpetuates the cost of disk usage.

Similar to apache spark, avoiding disk access saves disk access, which can augment performance. It is important to realize that the boundaries of the processing isn't different from disk/memory access. The only difference is that at the moment where the mapper (for example) writes a partition to disk and exits, it would simply stay around to wait for queries to be executed against the data in the partitions.

what's left is figure out how to express the functions to be executed against the data (which may be in any format) in a consistent way. Most of them are aggregation functions:

sum
group?
etc

Joins are a lot harder to achieve. Maybe the mapper/reducer process itself can implement specific functions that dictate how this is done, so that the framework doesn't become overly generic and hard to read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate keeping processes around #5

Investigate keeping processes around #5

gtoonstra commented Jul 4, 2015

Investigate keeping processes around #5

Investigate keeping processes around #5

Comments

gtoonstra commented Jul 4, 2015