The project is focused on parallelising pre-processing, measuring and machine learning in the cloud, as well as the evaluation and analysis of the cloud performance.
Dataset - A public dataset “Flowers” (3600 images, 5 classes) is used for the analysis.
About the project - A comprehensive in-depth analysis of the effect of parallelisation on the performance of various Cluster configurations(GCP's Dataproc) in terms of CPU/Memory utilisation, Disk I/O operations and Network bandwidth . The project also experiments with different VM configurations and distribution strategies to analyse the more efficient combination for training the ML model in the Google Cloud's AI platform.
Repository Contents - BigData_ML_Model_on_Google_Cloud_Platform.ipynb - Google Colaboratory file containing the code.