Skip to content

Intelli2Byte/EDA-on-Titanic-Dataset-using-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Titanic Data Analysis with PySpark

Welcome to the Titanic EDA project! This repository contains a comprehensive Exploratory Data Analysis (EDA) of the famous Titanic dataset, leveraging the power of Apache Spark (PySpark) and Python's visualization libraries.


πŸš€ Overview

This notebook demonstrates a hybrid approach to data science, combining the scalability of PySpark for data processing with the rich visualization capabilities of Pandas, Seaborn, and Matplotlib.

Key Features:

  • Data Integration: Seamlessly transition between Pandas and Spark DataFrames.
  • Big Data Ready: Utilizes Spark SQL for efficient data aggregation and statistical summaries.
  • Interactive EDA: Insights into survival rates, demographics, and class distributions.
  • Visual Storytelling: High-quality plots that highlight critical trends in the Titanic disaster.

πŸ› οΈ Tech Stack


πŸ”„ Project Flow

The analysis follows a structured pipeline tailored for Big Data Exploration:

  1. Infrastructure Setup: Installing and configuring the PySpark environment within an IPython kernel.
  2. Spark Session Initialization: Building the entry point for Spark SQL and DataFrame API.
  3. Cross-Library Data Loading: Fetching the dataset via Pandas (for quick HTTP handling) and migrating it to the Spark SQL engine.
  4. Schema & Statistical Discovery: Deep-dive into data types, null counts, and distribution metrics (Mean, Standard Deviation).
  5. Multi-Dimensional Aggregation: Leveraging Spark's groupBy and count operations for efficient frequency analysis.
  6. Visual Processing Pipeline: Converting filtered Spark outputs to Pandas for high-fidelity rendering with Seaborn and Matplotlib.

πŸ’Ž Project Characteristics

This analysis stands out for its specific implementation traits:

  • Seamless Framework Interop: Demonstrates how to switch between Pandas and PySpark to get the "best of both worlds" (Easy plotting + Scalable processing).
  • Statistical Accuracy: Uses Spark's describe() to handle numerical data with high precision.
  • Distribution Analysis: Focuses on Age density and Class vs Survival correlations as primary indicators.
  • Efficiency-First Design: Processes raw URL data directly without the need for manual CSV downloads.

πŸ“Š Analysis Highlights

The notebook covers the following analytical steps:

  1. Environment Setup: Automated PySpark installation and library configuration.
  2. Data Acquisition: Dynamic loading of the Titanic CSV dataset.
  3. Spark Operations:
    • Schema inspection and data type validation.
    • Descriptive statistics (mean, stddev, min, max).
    • Categorical grouping (Survival count, Pclass distribution).
  4. Visual Analysis:
    • Survival overview.
    • Age distribution with Kernel Density Estimate (KDE).
    • Bivariate analysis of Passenger Class vs. Survival.

βš™οΈ How to Run

  1. Prerequisites: Ensure you have Java 8 or higher installed (required for Spark).

  2. Installation:

    pip install pyspark pandas seaborn matplotlib
  3. Execution: Open the notebook in your preferred environment:

    jupyter notebook neha_exp9.ipynb

πŸ“‚ Project Structure

  • README.md: Project documentation and guide.

✨ Author

Created with ❀️ by Neha. If you find this project useful, please feel free to star the repository!

About

This project demonstrates data loading, transformation, and analysis using distributed computing. It includes basic statistical exploration, grouping, and visualization (using Pandas, Matplotlib, and Seaborn) to uncover patterns in passenger survival based on features like age, class, and gender.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors