Skip to content

Latest commit



125 lines (77 loc) · 2.84 KB

File metadata and controls

125 lines (77 loc) · 2.84 KB

Setting up IPython Notebook with PySpark

This is a brief notes on setting up environments for running pyspark via ipython notebook with Spark v1.4.1. The steps detailed are with running standalone spark on a single node. If you wish to run on amazon clusters, there’s the spark’s ec2 script for running on Amazon EC2, or directly create the EMR job and select Spark as an add-on via AWS Web console.


  • Java 1.7 or greater
  • Maven or Simple build tool (sbt)

Install Spark

Download latest spark (which is 1.4.1 as of July 2015) from I've selected spark-1.4.1.tar

extract the tar file

tar -xvf spark-1.4.1.tar

build spark as per the alternatively you can download the pre-built version if you wish and you can skip this step.

mvn clean package -DskipTests

Setup your environment variables for "SPARK_HOME" E.g. in Unix environments, add the following to ~/.bash_profile

export SPARK_HOME=<location of the install>

verify that pyspark is installed ok

cd <spark-distro-directory>

import math
testRdd = sc.parallelize([4,16,9])

#verify results

#to quit

Install Python

Install Anaconda

Download Anaconda which include python 2.7 and the main scientific libraries

conda update conda
conda update ipython ipython-notebook ipython-qtconsole

Anaconda comes with free spyder IDE. There is other free IDEs and text editors such as Sublime, emacs. There's also non-free ones such as PyCharm by Jetbrains.

Run PySpark from IPython notebook

Ipython notebook is sort of similar to Mathematica. Its a web app that allows you to write descriptions, images, visualization and executing run code.

Download the following python setup script from Github to create a new pyspark profile for running Ipython notebook Run


Start ipython notebook from terminal

ipython notebook

Open your browser and navigate to to view the ipython notebooks:


If you would like to change to a different port, modify the following line in the script

ip = '*' # Warning: this is potentially insecure
port = <new-port>

Create a new notebook by

  • New-> Python(2 or 3)
  • Press + to create new cell
  • Press play icon to run (or ctrl +enter)

Test that Pyspark is working from Ipython notebook by pasting the following to a cell and hit run

import math
testRdd = sc.parallelize([16,16,9])