-
Notifications
You must be signed in to change notification settings - Fork 1
Conversation
jordan-heemskerk
commented
Jul 25, 2016
- Cleaned out a bunch of stuff that shouldn't be in the repo
- Restructured wc.go to work with Hadoop streaming, first step in getting it to go on EMR
- Restructure graphbuilder.go to work with new wc.go outputs
- Make it work with Hadoop streaming! - Need to be able to read/write using STDIN and STDOUT - Map and Reduce should be controlled by cmdline options
- I forked the Trie repo and added the features for us :)
- AWS EMR barfs if it already exists so need a unique one everytime
@eburdon this too! |
ec01ea0 is the crowning achievement... see this fvbock/trie#2 |
😮 You contributed to open source?! NIIIIIIIIIIIIICE Just looking through the files now... I was thinking that once this is stable, we'd just fire up the smallest EC2 cluster and run EMR on that instead of spot instances until the 12th. Shouldn't be too expensive and would prevent Lambda from having to configure every time. Thoughts? |
My thoughts exactly. We can spin up a small EMR cluster and leave it running. Lambda can just submit jobs to it using the API (available for most major languages) and then fetch the results from S3 when they are available. |
👍 Just for deleting all the junk alone... I packaged the existing repo just for safety's sake. Otherwise, looks great, and the plan sounds solid! Merge when ready. |
To confirm, looks like there's 1 input, 1 output now? |
I don't have permission to merge in this repo I don't think. You can just do it if you want, or gimme god mode :P. |
Input and output all happens over STDIN and STDOUT now, as required by hadoop streaming. There is one input, it controls whether the execution is mapping or reducing. |
haha ok. Alex can get the command needed to run from lambda from this codebase / readme? |
@eburdon is he in this repo yet? Have him ask me, its going to depend on what he is calling it from |