Welcome to the Titanic EDA project! This repository contains a comprehensive Exploratory Data Analysis (EDA) of the famous Titanic dataset, leveraging the power of Apache Spark (PySpark) and Python's visualization libraries.
This notebook demonstrates a hybrid approach to data science, combining the scalability of PySpark for data processing with the rich visualization capabilities of Pandas, Seaborn, and Matplotlib.
- Data Integration: Seamlessly transition between Pandas and Spark DataFrames.
- Big Data Ready: Utilizes Spark SQL for efficient data aggregation and statistical summaries.
- Interactive EDA: Insights into survival rates, demographics, and class distributions.
- Visual Storytelling: High-quality plots that highlight critical trends in the Titanic disaster.
- Language: Python 3
- Data Processing: PySpark, Pandas
- Visualization: Seaborn, Matplotlib
- Environment: Jupyter Notebook / Google Colab
The analysis follows a structured pipeline tailored for Big Data Exploration:
- Infrastructure Setup: Installing and configuring the PySpark environment within an IPython kernel.
- Spark Session Initialization: Building the entry point for Spark SQL and DataFrame API.
- Cross-Library Data Loading: Fetching the dataset via Pandas (for quick HTTP handling) and migrating it to the Spark SQL engine.
- Schema & Statistical Discovery: Deep-dive into data types, null counts, and distribution metrics (Mean, Standard Deviation).
- Multi-Dimensional Aggregation: Leveraging Spark's
groupByandcountoperations for efficient frequency analysis. - Visual Processing Pipeline: Converting filtered Spark outputs to Pandas for high-fidelity rendering with Seaborn and Matplotlib.
This analysis stands out for its specific implementation traits:
- Seamless Framework Interop: Demonstrates how to switch between Pandas and PySpark to get the "best of both worlds" (Easy plotting + Scalable processing).
- Statistical Accuracy: Uses Spark's
describe()to handle numerical data with high precision. - Distribution Analysis: Focuses on Age density and Class vs Survival correlations as primary indicators.
- Efficiency-First Design: Processes raw URL data directly without the need for manual CSV downloads.
The notebook covers the following analytical steps:
- Environment Setup: Automated PySpark installation and library configuration.
- Data Acquisition: Dynamic loading of the Titanic CSV dataset.
- Spark Operations:
- Schema inspection and data type validation.
- Descriptive statistics (mean, stddev, min, max).
- Categorical grouping (Survival count, Pclass distribution).
- Visual Analysis:
- Survival overview.
- Age distribution with Kernel Density Estimate (KDE).
- Bivariate analysis of Passenger Class vs. Survival.
-
Prerequisites: Ensure you have Java 8 or higher installed (required for Spark).
-
Installation:
pip install pyspark pandas seaborn matplotlib
-
Execution: Open the notebook in your preferred environment:
jupyter notebook neha_exp9.ipynb
README.md: Project documentation and guide.
Created with β€οΈ by Neha. If you find this project useful, please feel free to star the repository!