The objective of this project is to build a data pipeline to show and analyse the results in PowerBI from the MovieLens 25M database, using Hive and Python.
The objectives with this project are:
- Develop insights from a large dataset in Cloudera VM
- Build and Optimize powerful HiveQL queries to be run on large CSV files
- Enrich the basis MovieLen's Database with TMDB's database using API with python
- Analyse and draw several conclusions about the Cinema Industry Environment
All the HiveQL queries can be seen in the "Data Ingestion and Queries" folder and the final PowerBI dashboard for visualization in the "PowerBI Dashboard" folder.