This repository includes the materials for the PySpark workshop in AMLD2019.
See INSTALLATION_UNIX.md in the docs
folder.
See INSTALLATION_WINDOWS.md in the docs
folder.
See GOOGLECOLAB_README.md in the docs
folder.
If you run PySpark on your laptop then start with the notebook data_processing_start.ipynb in the src/local
folder. The completed notebook data_processing_end.ipynb is in the same folder. It is highly recommended that you always start with the empty notebooks (_start.ipynb
) and try to avoid copy-pasting the cells of the completed notebook. It is a good practice to start writing the code from scratch!
If you run PySpark on Google Colab then start with the notebook data_processing_gc_start.ipynb in the src/google_colab
folder. The completed notebook can be found in data_processing_gc_end.ipynb is in the same folder.
If you run PySpark on your laptop then start with the notebook spark_mllib_start.ipynb in the src/local
folder. The completed notebook spark_mllib_end.ipynb is in the same folder.
If you run PySpark on Google Colab then start with the notebook spark_mllib_gc_start.ipynb in the src/google_colab
folder. The completed notebook spark_mllib_gc_end.ipynb is in the same folder.
See AWS_README.md in the docs
folder.