Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate keeping processes around #5

Open
gtoonstra opened this issue Jul 4, 2015 · 0 comments
Open

Investigate keeping processes around #5

gtoonstra opened this issue Jul 4, 2015 · 0 comments

Comments

@gtoonstra
Copy link
Owner

The map/reduce examples have clear boundaries between startup, reading data, processing data and writing it out to disk. The process lifetime doesn't extend beyond those boundaries, which always perpetuates the cost of disk usage.

Similar to apache spark, avoiding disk access saves disk access, which can augment performance. It is important to realize that the boundaries of the processing isn't different from disk/memory access. The only difference is that at the moment where the mapper (for example) writes a partition to disk and exits, it would simply stay around to wait for queries to be executed against the data in the partitions.

what's left is figure out how to express the functions to be executed against the data (which may be in any format) in a consistent way. Most of them are aggregation functions:

  • sum
  • group?
  • etc

Joins are a lot harder to achieve. Maybe the mapper/reducer process itself can implement specific functions that dictate how this is done, so that the framework doesn't become overly generic and hard to read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant