This repository contains solutions for assignments related to cloud-based data analytics, covering technologies such as Hadoop MapReduce, Apache Spark, Azure Cloud Platform, and Machine Learning using Azure ML. The assignments were completed as part of the MIE1628 - Cloud-Based Data Analytics course at the University of Toronto.
- Topic: Implementation of Line Counting and K-Means Clustering using Hadoop MapReduce.
- Files:
Assignment1_Solution.pdf
- Key Concepts:
- Line Count Program Implementation
- K-Means Clustering on MapReduce
- Canopy Selection for K-Means Optimization
- Topic: Data processing and recommendation system using PySpark and SQL Spark.
- Files:
Assignment2_Solution.html
- Key Concepts:
- Counting Odd/Even Numbers from a Dataset
- Salary Summation per Department
- Word Count and Frequency Analysis using PySpark
- Collaborative Filtering for Movie Recommendations
- Model Evaluation using RMSE & MAE
- Topic: Intrusion Detection and Data Analysis using PySpark on Databricks.
- Files:
Assignment3_Solution.html
- Key Concepts:
- Extracting and Processing KDD Cup 99 Data
- Feature Engineering & Exploratory Data Analysis
- Machine Learning Model for Intrusion Detection
- Cloud-based Data Processing
- Topic: Working with Azure Data Factory, Azure SQL Database, and ADLS Gen2.
- Files:
Assignment4_Solution.html
- Key Concepts:
- Data Pipelines with Azure Data Factory
- SQL Queries on Gender Jobs Data
- Setting Up Bi-Directional Data Replication
- Topic: Machine Learning using Azure ML Studio and Stream Analytics.
- Files:
Assignment5_Solution.ipynb
- Key Concepts:
- Stream Analytics with IoT Data Processing
- Data Exploration and Preprocessing
- Machine Learning Model Training and Evaluation
- Automated ML and Hyperparameter Tuning
- Clone the repository:
git clone https://github.com/manish-kotra/Azure-Projects.git cd Azure-Projects - Open relevant files:
.pdfand.htmlfiles can be viewed in a browser or a PDF reader..ipynbfiles should be opened in Jupyter Notebook or Azure ML Studio.
- Big Data Processing: Hadoop, Apache Spark, PySpark
- Cloud Platforms: Azure Data Factory, Azure SQL DB, ADLS Gen2
- Machine Learning: Azure ML, Python, Scikit-Learn
- Data Visualization & Analysis: Pandas, Matplotlib, SQL
This repository is for academic purposes only. Please do not plagiarize or distribute without permission.
Manish Kumar - University of Toronto