Cloud-Based Data Analytics Assignments

Overview

This repository contains solutions for assignments related to cloud-based data analytics, covering technologies such as Hadoop MapReduce, Apache Spark, Azure Cloud Platform, and Machine Learning using Azure ML. The assignments were completed as part of the MIE1628 - Cloud-Based Data Analytics course at the University of Toronto.

Assignments

Assignment 1: Hadoop MapReduce

Topic: Implementation of Line Counting and K-Means Clustering using Hadoop MapReduce.
Files:
- Assignment1_Solution.pdf
Key Concepts:
- Line Count Program Implementation
- K-Means Clustering on MapReduce
- Canopy Selection for K-Means Optimization

Assignment 2: Apache Spark

Topic: Data processing and recommendation system using PySpark and SQL Spark.
Files:
- Assignment2_Solution.html
Key Concepts:
- Counting Odd/Even Numbers from a Dataset
- Salary Summation per Department
- Word Count and Frequency Analysis using PySpark
- Collaborative Filtering for Movie Recommendations
- Model Evaluation using RMSE & MAE

Assignment 3: Spark and Cloud Data Platform

Topic: Intrusion Detection and Data Analysis using PySpark on Databricks.
Files:
- Assignment3_Solution.html
Key Concepts:
- Extracting and Processing KDD Cup 99 Data
- Feature Engineering & Exploratory Data Analysis
- Machine Learning Model for Intrusion Detection
- Cloud-based Data Processing

Assignment 4: Azure Cloud Platform

Topic: Working with Azure Data Factory, Azure SQL Database, and ADLS Gen2.
Files:
- Assignment4_Solution.html
Key Concepts:
- Data Pipelines with Azure Data Factory
- SQL Queries on Gender Jobs Data
- Setting Up Bi-Directional Data Replication

Assignment 5: Azure Machine Learning

Topic: Machine Learning using Azure ML Studio and Stream Analytics.
Files:
- Assignment5_Solution.ipynb
Key Concepts:
- Stream Analytics with IoT Data Processing
- Data Exploration and Preprocessing
- Machine Learning Model Training and Evaluation
- Automated ML and Hyperparameter Tuning

Setup and Usage

Clone the repository:

git clone https://github.com/manish-kotra/Azure-Projects.git
cd Azure-Projects

Open relevant files:
- .pdf and .html files can be viewed in a browser or a PDF reader.
- .ipynb files should be opened in Jupyter Notebook or Azure ML Studio.

Technologies Used

Big Data Processing: Hadoop, Apache Spark, PySpark
Cloud Platforms: Azure Data Factory, Azure SQL DB, ADLS Gen2
Machine Learning: Azure ML, Python, Scikit-Learn
Data Visualization & Analysis: Pandas, Matplotlib, SQL

License

This repository is for academic purposes only. Please do not plagiarize or distribute without permission.

Author

Manish Kumar - University of Toronto

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
A1		A1
A2		A2
A3		A3
A4		A4
A5		A5
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cloud-Based Data Analytics Assignments

Overview

Assignments

Assignment 1: Hadoop MapReduce

Assignment 2: Apache Spark

Assignment 3: Spark and Cloud Data Platform

Assignment 4: Azure Cloud Platform

Assignment 5: Azure Machine Learning

Setup and Usage

Technologies Used

License

Author

About

Uh oh!

Releases

Packages

Languages

manish-kotra/Azure-Projects

Folders and files

Latest commit

History

Repository files navigation

Cloud-Based Data Analytics Assignments

Overview

Assignments

Assignment 1: Hadoop MapReduce

Assignment 2: Apache Spark

Assignment 3: Spark and Cloud Data Platform

Assignment 4: Azure Cloud Platform

Assignment 5: Azure Machine Learning

Setup and Usage

Technologies Used

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages