YouTube Data Analysis ETL Using AWS (S3, Glue, Lambda, Athena, Cloudwatch, QuickSight)

This project creates a scalable data pipeline to analyze YouTube data from Kaggle using AWS services. It processes raw JSON and CSV files into cleansed, partitioned datasets, integrates them with ETL workflows, and catalogs data for querying. Final insights are visualized in QuickSight dashboards.

Dataset Description (Trending YouTube Video Statistics)

The Youtube Kaggle dataset contains raw JSON and CSV files with details about trending YouTube videos by region and category. It provides insights into video popularity, user engagement, and regional content trends, making it valuable for: • Identifying popular categories and regions for targeted content creation. • Analyzing audience engagement metrics like likes, dislikes, and comments. • Tracking video performance trends over time.

Dataset Link: https://www.kaggle.com/datasets/datasnaek/youtube-new

Business Problem

This project addresses the following business questions:

What types of content perform best in specific regions?
Which video categories generate the highest engagement?
How does audience interaction (likes, dislikes, comments) correlate with views?
What keywords or topics drive video popularity?

AWS Services Used

AWS S3: Data storage for raw, cleansed, and final datasets.
AWS Glue: ETL processes and data cataloging.
AWS Lambda: JSON data processing and array expansion.
Amazon Athena: Data querying and analysis.
Amazon QuickSight: Visualization and dashboard creation.
AWS CloudWatch: Monitoring logs for Glue and Lambda.
IAM Roles: Secure access control for AWS services.

Architecture Diagram

Steps Implemented

Data Ingestion- Downloaded Kaggle dataset consisting of multiple CSV and JSON files. Bulk uploaded files to the S3 raw bucket using AWS CLI.
Data Cataloging- Created a Glue crawler to catalog raw data in S3. Queried raw data using Athena for initial exploration.
Data Processing

JSON Files:Developed a Lambda function to process raw JSON files. Expanded arrays and stored processed data in the cleansed S3 bucket in Parquet format. Added an S3 trigger to automatically invoke the Lambda function when new files are uploaded.
CSV Files: Created a Glue ETL job to process raw CSV files. Stored processed data in the cleansed S3 bucket partitioned by region.

Data Integration- Developed a Glue ETL job to join cleansed CSV and JSON files on id and category_id. Stored the integrated dataset in the final analytics S3 bucket in Parquet format. Created a Glue catalog for the final analytics dataset.
Data Visualization-Queried final analytics data using Athena. Connected Athena to Amazon QuickSight to build dashboards and reports.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data		Data
Snippets		Snippets
Amazon_S3_CLI_copy_commands.sh		Amazon_S3_CLI_copy_commands.sh
Glue_ETL_Job_parquet_cleaned_to_analytics_version.py.py		Glue_ETL_Job_parquet_cleaned_to_analytics_version.py.py
Glue_Job_csv_raw_to_cleaned_version.py.py		Glue_Job_csv_raw_to_cleaned_version.py.py
README.md		README.md
Youtube_Analysis_Architecture_Digaram.drawio.png		Youtube_Analysis_Architecture_Digaram.drawio.png
lambda_function.py.py		lambda_function.py.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Data Analysis ETL Using AWS (S3, Glue, Lambda, Athena, Cloudwatch, QuickSight)

Dataset Description (Trending YouTube Video Statistics)

Business Problem

AWS Services Used

Architecture Diagram

Steps Implemented

Quicksight Dashboard

About

Releases

Packages

Languages

deept-agl/Youtube-data-ETL-Analysis-using-AWS

Folders and files

Latest commit

History

Repository files navigation

YouTube Data Analysis ETL Using AWS (S3, Glue, Lambda, Athena, Cloudwatch, QuickSight)

Dataset Description (Trending YouTube Video Statistics)

Business Problem

AWS Services Used

Architecture Diagram

Steps Implemented

Quicksight Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages