Titanic Data Analysis with PySpark

Welcome to the Titanic EDA project! This repository contains a comprehensive Exploratory Data Analysis (EDA) of the famous Titanic dataset, leveraging the power of Apache Spark (PySpark) and Python's visualization libraries.

🚀 Overview

This notebook demonstrates a hybrid approach to data science, combining the scalability of PySpark for data processing with the rich visualization capabilities of Pandas, Seaborn, and Matplotlib.

Key Features:

Data Integration: Seamlessly transition between Pandas and Spark DataFrames.
Big Data Ready: Utilizes Spark SQL for efficient data aggregation and statistical summaries.
Interactive EDA: Insights into survival rates, demographics, and class distributions.
Visual Storytelling: High-quality plots that highlight critical trends in the Titanic disaster.

🛠️ Tech Stack

Language: Python 3
Data Processing: PySpark, Pandas
Visualization: Seaborn, Matplotlib
Environment: Jupyter Notebook / Google Colab

🔄 Project Flow

The analysis follows a structured pipeline tailored for Big Data Exploration:

Infrastructure Setup: Installing and configuring the PySpark environment within an IPython kernel.
Spark Session Initialization: Building the entry point for Spark SQL and DataFrame API.
Cross-Library Data Loading: Fetching the dataset via Pandas (for quick HTTP handling) and migrating it to the Spark SQL engine.
Schema & Statistical Discovery: Deep-dive into data types, null counts, and distribution metrics (Mean, Standard Deviation).
Multi-Dimensional Aggregation: Leveraging Spark's groupBy and count operations for efficient frequency analysis.
Visual Processing Pipeline: Converting filtered Spark outputs to Pandas for high-fidelity rendering with Seaborn and Matplotlib.

💎 Project Characteristics

This analysis stands out for its specific implementation traits:

Seamless Framework Interop: Demonstrates how to switch between Pandas and PySpark to get the "best of both worlds" (Easy plotting + Scalable processing).
Statistical Accuracy: Uses Spark's describe() to handle numerical data with high precision.
Distribution Analysis: Focuses on Age density and Class vs Survival correlations as primary indicators.
Efficiency-First Design: Processes raw URL data directly without the need for manual CSV downloads.

📊 Analysis Highlights

The notebook covers the following analytical steps:

Environment Setup: Automated PySpark installation and library configuration.
Data Acquisition: Dynamic loading of the Titanic CSV dataset.
Spark Operations:
- Schema inspection and data type validation.
- Descriptive statistics (mean, stddev, min, max).
- Categorical grouping (Survival count, Pclass distribution).
Visual Analysis:
- Survival overview.
- Age distribution with Kernel Density Estimate (KDE).
- Bivariate analysis of Passenger Class vs. Survival.

⚙️ How to Run

Prerequisites: Ensure you have Java 8 or higher installed (required for Spark).

Installation:

pip install pyspark pandas seaborn matplotlib

Execution: Open the notebook in your preferred environment:
```
jupyter notebook neha_exp9.ipynb
```

📂 Project Structure

README.md: Project documentation and guide.

✨ Author

Created with ❤️ by Neha. If you find this project useful, please feel free to star the repository!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
neha_exp9.ipynb		neha_exp9.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titanic Data Analysis with PySpark

🚀 Overview

Key Features:

🛠️ Tech Stack

🔄 Project Flow

💎 Project Characteristics

📊 Analysis Highlights

⚙️ How to Run

📂 Project Structure

✨ Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Titanic Data Analysis with PySpark

🚀 Overview

Key Features:

🛠️ Tech Stack

🔄 Project Flow

💎 Project Characteristics

📊 Analysis Highlights

⚙️ How to Run

📂 Project Structure

✨ Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages