Skip to content
jashkenas edited this page Sep 13, 2010 · 9 revisions

After writing a CloudCrowd::Action and installing it into the actions folder, CloudCrowd will be ready to run your own custom jobs. A minimal action consists of a single method, process, which defines the parallel part of the computation.

Optionally, actions may define a split method, which, running before process, splits up a single input into multiple inputs to be processed in parallel. All of the inputs to a job are already running in parallel in the first place, so defining a split method simply multiplies the potential parallelism of your job by a certain factor. Actions may also define a merge method, which receive all of the outputs of process in order to derive a single result.

An example of an action which employs all three stages is process_pdfs, included by default. It splits a single PDF input into smaller 10-page chunks, processes each page into a series of scaled images, as well as extracting the full text for that page, and then, when complete, merges all of the resulting files back together into a zipped-up directory, ready for download and import.

Split / Process / Merge

Actions are simple Ruby classes that inherit from CloudCrowd::Action, and implement at least a process method, for running the parallel portion of a task.

split optional A method that uses the input — the input being anything JSON-serializable, or a URL to a file — and splits it up into work units suitable for parallel processing. Return an array of output from this method, and each element in the array will be sent as the input to the next step (process), in parallel across all your workers.
process mandatory Perform the parallel portion of the computation. The input comes directly from your Job request, or, if you have a split method defined, from the output of split. The return value from process is either returned as the output of the job, or sent to merge for further processing.
merge optional The input is an unordered array of all the outputs from the process stage. The return value of merge is the result of the Job.

The Action API

Inheriting from CloudCrowd::Action also provides a handful of convenience methods:

input The input value to this stage of the action. In split and process, this is usually the URL to the file that needs to be processed, but it can be anything JSON-serializable. In merge, this is an array containing every result that was returned from all of the workers processing (which are also often, but not necessarily, URLs).
options As part of creating a job, you can specify an options configuration hash of arbitrary JSON for internal use. This options hash is passed through verbatim to each stage and every work unit.
download(url,path) A convenience method to download the contents of a given URL to local storage, specified by a relative path.
save(path) Saves the temporary file at the given path to a permanent location, using the AssetStore (S3 in production). This method will return a publicly accessible or authenticated URL to the file, which is usually what you’ll want to output.
input_path If the input is a URL, it will automatically be downloaded for you to local storage before the action begins. input_path gives you the local path to the input file.

In addition, all actions run within a work-unit-specific temporary directory, so relative paths work great for saving and loading scratch files, and all your scratch work will be automatically deleted at the end of the action.

Counting Words (MapReduce 101)

The “Hello World” of MapReduce is to perform a distributed word count. So let’s implement a word_count action in what will be essentially two lines of code. We define process to use the UNIX wc utility to count the number of words in the input file (we’d prefer not to read the file into memory). We define merge to sum the total resulting word counts across all inputs. The complete class reads like this:

class WordCount < CloudCrowd::Action

  def process
    (`wc -w #{input_path}`).match(/\A\s*(\d+)/)[1].to_i
  end

  def merge
    input.inject(0) {|sum, count| sum + count }
  end

end

You can see how WordCount’s process returns the numeric count, and merge receives as its input the array of resulting counts. The input to process was the URL to a text file, so the file had been automatically downloaded before process was called, and made available through input_path.

There are many more examples available, both in the example configuration folder that crowd install provides, as well as in the Gallery of Actions.