-
Notifications
You must be signed in to change notification settings - Fork 37
Developing With EclairJS
EclairJS Node provides the Apache Spark API with some minor differences. Its a work in progress so please check our API docs to see which APIs are currently implemented.
Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers and then collect them.
var eclairjs = require('eclairjs');
var spark = new eclairjs();
var sc = new spark.SparkContext("local[*]", "Simple Spark Program");
var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
rdd.collect().then(function(results) {
console.log("results: ", results);
sc.stop();
}).catch(function(err) {
console.error(err);
sc.stop();
});
The eclairjs
object provides the Apache Spark API (such as SparkContext).
In version 0.7
EclairJS now allows multiple Spark jobs to be run concurrently, which required a change in the API. Before you could just require the EclairJS module with:
var eclairjs = require('eclairjs');
var sc = new eclairjs.SparkContext("local[*]", "Simple Spark Program");
And in 0.7
you now have to create a new instance of EclairJS from the required module:
var eclairjs = require('eclairjs');
var spark = new eclairjs();
var sc = new spark.SparkContext("local[*]", "Simple Spark Program");
Methods that return single Spark objects will immediately return their result. For parallelize
the RDD object is returned directly.
var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
If it returns an array of Spark objects (such as RDD.randomSplit
) or if it returns a native JavaScript object (like collect
) will return a Promise
.
rdd.collect().then(function(results) {
...
}).catch(function(err) {
...
});
This is done because these calls can take a while to return a result and Node.js does not like having its execution blocked.
It is important to stop the SparkContext
when your application is done. The same applies to StreamingContext
.
rdd7.take(10).then(function(val) {
...
sc.stop();
}).catch(function(err) {
...
sc.stop();
});
This is because the actual Apache Spark program is running remotely and you always want to clean up after yourself. Also, EclairJS will not stop the SparkContext
if your Node application exits, so consider handling that, especially if you have a long running Apache Spark program using Streaming.
function exit() {
process.exit(0);
}
function stop(e) {
if (e) {
console.log('Error:', e);
}
if (sc) {
sc.stop().then(exit).catch(exit);
}
}
process.on('SIGTERM', stop);
process.on('SIGINT', stop);
We have a separate page about Lambda functions.
We have a separate page devoted to debugging