-
Notifications
You must be signed in to change notification settings - Fork 26
Working With Lambda Functions
Several Spark APIs take in lambda (also called anonymous) functions as arguments, which will get executed on the Spark worker nodes.
Lets say we want to filter out a specific word from an RDD and the specific word can't be hardcoded in the Lambda function. In EclairJS, we have added an additional argument called bindArgs
just for this case. bindArgs
is an array of values that will get appended to the arguments of the Lambda function:
var wordToFilterOut = 'foo;'
var filteredRDD = rdd.filter(function(word, wordToFilterOut) {
return word.trim() !== wordToFilterOut;
}, [wordToFilterOut]);
As of version 0.4
of EclairJS, if a Lambda function needs to generate a Spark class instance, the Spark class needs to be passed in via the bindArgs
array.
For example, in RDD.mapToPair
, to generate a Pair
we need to create a Tuple
like this:
var eclairjs = require('eclairjs');
var spark = new eclairjs();
...
var pairRDD = rdd.mapToPair(function(word, Tuple) {
return new Tuple2(word.toLowerCase(), 1);
}, [spark.Tuple2]);
This is done for performance reasons - instead of sending over all Spark classes to be loaded in the workers, we can only send what is needed, reducing the amount of data transferred and the amount of data loaded into memory.