PySpark Workshop

This repository includes the materials for the PySpark workshop in AMLD2019.

Part 1: PySpark for Big Data Processing

1.1 Installation:

1.1.1 Method 1: Running PySpark locally (e.g. on your laptop)

Mac OS or Linux:

See INSTALLATION_UNIX.md in the docs folder.

Windows

See INSTALLATION_WINDOWS.md in the docs folder.

1.1.2 Method 2: Running PySpark on Google Colab

See GOOGLECOLAB_README.md in the docs folder.

1.2 Agenda:

1.2.1 Data processing in PySpark

If you run PySpark on your laptop then start with the notebook data_processing_start.ipynb in the src/local folder. The completed notebook data_processing_end.ipynb is in the same folder. It is highly recommended that you always start with the empty notebooks (_start.ipynb) and try to avoid copy-pasting the cells of the completed notebook. It is a good practice to start writing the code from scratch!

If you run PySpark on Google Colab then start with the notebook data_processing_gc_start.ipynb in the src/google_colab folder. The completed notebook can be found in data_processing_gc_end.ipynb is in the same folder.

1.2.2 Machine learning in PySpark (MLlib)

If you run PySpark on your laptop then start with the notebook spark_mllib_start.ipynb in the src/local folder. The completed notebook spark_mllib_end.ipynb is in the same folder.

If you run PySpark on Google Colab then start with the notebook spark_mllib_gc_start.ipynb in the src/google_colab folder. The completed notebook spark_mllib_gc_end.ipynb is in the same folder.

Part 2: Running PySpark in Jupyter Notebook on Amazon Clusters

See AWS_README.md in the docs folder.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
docs		docs
emr_bootstrap		emr_bootstrap
slides		slides
src		src
.gitignore		.gitignore
README.md		README.md
RESOURCES.md		RESOURCES.md
env_var_settings.sh		env_var_settings.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Workshop

Part 1: PySpark for Big Data Processing

1.1 Installation:

1.1.1 Method 1: Running PySpark locally (e.g. on your laptop)

Mac OS or Linux:

Windows

1.1.2 Method 2: Running PySpark on Google Colab

1.2 Agenda:

1.2.1 Data processing in PySpark

1.2.2 Machine learning in PySpark (MLlib)

Part 2: Running PySpark in Jupyter Notebook on Amazon Clusters

About

Releases

Packages

Languages

hamedrazavi/pyspark_amld2019

Folders and files

Latest commit

History

Repository files navigation

PySpark Workshop

Part 1: PySpark for Big Data Processing

1.1 Installation:

1.1.1 Method 1: Running PySpark locally (e.g. on your laptop)

Mac OS or Linux:

Windows

1.1.2 Method 2: Running PySpark on Google Colab

1.2 Agenda:

1.2.1 Data processing in PySpark

1.2.2 Machine learning in PySpark (MLlib)

Part 2: Running PySpark in Jupyter Notebook on Amazon Clusters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages