Skip to content

Developing With EclairJS

Doron Rosenberg edited this page Sep 23, 2016 · 8 revisions

EclairJS Node provides the Apache Spark API with some minor differences. Its a work in progress so please check our API docs to see which APIs are currently implemented.

Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers and then collect them.

var eclairjs = require('eclairjs');

var spark = new eclairjs();

var sc = new spark.SparkContext("local[*]", "Simple Spark Program");

var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

rdd.collect().then(function(results) {
  console.log("results: ", results);
  sc.stop();
}).catch(function(err) {
  console.error(err);
  sc.stop();
});

The eclairjs object provides the Apache Spark API (such as SparkContext).

Creating a new Spark Context

In version 0.7 EclairJS now allows multiple Spark jobs to be run concurrently, which required a change in the API. Before you could just require the EclairJS module with:

var eclairjs = require('eclairjs');

var sc = new eclairjs.SparkContext("local[*]", "Simple Spark Program");

And in 0.7 you now have to create a new instance of EclairJS from the required module:

var eclairjs = require('eclairjs');

var spark = new eclairjs();

var sc = new spark.SparkContext("local[*]", "Simple Spark Program");

Method Calls

Methods that return single Spark objects will immediately return their result. For parallelize the RDD object is returned directly.

var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

If it returns an array of Spark objects (such as RDD.randomSplit) or if it returns a native JavaScript object (like collect) will return a Promise.

rdd.collect().then(function(results) {
   ...
}).catch(function(err) {
   ...
});

This is done because these calls can take a while to return a result and Node.js does not like having its execution blocked.

Stopping the SparkContext

It is important to stop the SparkContext when your application is done. The same applies to StreamingContext.

rdd7.take(10).then(function(val) {
  ...
  sc.stop();

}).catch(function(err) {
  ...
  sc.stop();
});

This is because the actual Apache Spark program is running remotely and you always want to clean up after yourself. Also, EclairJS will not stop the SparkContext if your Node application exits, so consider handling that, especially if you have a long running Apache Spark program using Streaming.

function exit() {
  process.exit(0);
}

function stop(e) {
  if (e) {
    console.log('Error:', e);
  }

  if (sc) {
    sc.stop().then(exit).catch(exit);
  }
}

process.on('SIGTERM', stop);
process.on('SIGINT', stop);

Lambda Functions

We have a separate page about Lambda functions.

Error Handling and Debugging

We have a separate page devoted to debugging