- This project analyzes Uber ride request data to uncover demand patterns and driver behavior using Python.
- It includes cleaning, feature engineering, visualizations, and clustering of pickup locations with K-Means.
- 📌 Description
- 🔄 Project Workflow
- 💡 Key Insights
- 🛠️ Recommendations
- 🗂️ Dataset
- ⚙️ Installation
- 🚀 Run the Full Pipeline
- 📽️ Inference Demo
- 📁 Project Structure
- 👥 Contributors
- 📝 License
A comprehensive end-to-end data science project analyzing real-world Uber ride request data using Python.
-
Data Cleaning:
- Parsed and standardized datetime columns (
Request timestamp,Drop timestamp). - Handled missing values in
Driver idandDrop timestampwithout dropping large portions of data. - Cleaned categorical fields like
StatusandPickup point(e.g., stripping whitespace, title casing).
- Parsed and standardized datetime columns (
-
Feature Engineering:
- Extracted informative time-based features:
Request hour,Request day, andRequest DateTrip duration(as timedelta and in minutes)Time slotclassification (Morning, Afternoon, Evening, Night)- Boolean flags like
Driver AvailableandIs Completed
- Extracted informative time-based features:
-
Data Visualization:
- Identified:
- Demand peaks (morning and evening)
- Driver inactivity periods
- Request trends by hour, day, and location
- Cancellation patterns and completion rates
- Identified:
-
Clustering with K-Means (k=5):
- Simulated pickup coordinates and applied clustering to segment NYC into 5 operational zones.
- Visualized high-density clusters vs. low-demand zones.
- Peak Demand: Evening hours (5–9 PM) have highest request volume but lowest driver availability.
- Driver Inactivity: Fridays show high inactivity despite high demand.
- Trip Duration: Early morning trips (12–4 AM) are slightly longer.
- Top Drivers: Most top-10 drivers completed between 20–22 trips — tight distribution.
- Pickup Patterns: Majority of requests originate from the City; Airport has high rate of "No Cars Available".
- Incentivize drivers during evening peaks and on inactive days (e.g., Friday).
- Promote early morning hours (12–3 AM) with offers and safety campaigns.
- Use K-Means zones to operationally split NYC for optimized driver distribution.
- Coordinate with airports for better pickup logistics & reduce city cancellations.
- File Name:
Uber Request Data.csv - Source: Provided by course instructors
- Format: Tab-separated values (
.tsv) - Rows: 6,745 Uber ride requests
- Description: Raw log of Uber rides captured over a period of several days. Used for analyzing ride demand, driver activity, cancellations, and trip patterns.
Request id: Unique numeric ID for each ride requestPickup point: Location where the ride was requested from (CityorAirport)Driver id: ID of the driver assigned (may be missing if unassigned)Status: Trip outcome -Trip Completed,Cancelled, orNo Cars AvailableRequest timestamp: Timestamp when the user requested the rideDrop timestamp: Timestamp when the trip was completed (may be null)
To get started, make sure you have Python 3.10+ installed on your system.
Then install all required libraries by running:
pip install -r requirements.txtpandas→ data manipulationnumpy→ numerical operationsmatplotlib,seaborn→ visualizationsscikit-learn→ KMeans clusteringpython-dateutil→ parsing flexible timestamps
Ensure the following are in place before running the project:
data/Uber Request Data.csvis inside thedata/foldersrc/contains all modular code filesoutput/folder exists for saving processed resultsmain.pyexists in the project root
Run the entire analysis and visualization pipeline with:
python main.pyOnce the project is launched via main.py, the following steps are performed:
- Loads and parses the raw Uber request dataset
- Cleans and preprocesses timestamps, missing values, and categories
- Extracts useful features like time slot, duration, and weekday
- Adds simulated geo-coordinates and clusters pickup points using K-Means
- Displays visual insights (trip patterns, demand peaks, hotspot regions)
- Saves enriched dataset to
output/Uber with features.csv
Visualizations appear live using matplotlib and seaborn
Final processed dataset is reusable for further analysis
The project is organized in a modular, production-ready structure:
Uber-Trip-Analysis/
│
├── data/ # Raw input data (Uber Request Data.csv)
│ └── Uber Request Data.csv
│
├── output/ # Final enriched data & visual assets
│ └── Uber with features.csv
│
├── src/ # All modular processing code
│ ├── __init__.py
│ ├── data_loading_and_exploration.py
│ ├── data_cleaning_and_preprocessing.py
│ ├── data_feature_engineering.py
│ ├── data_visualization.py
│ ├── location_clustering.py
│ └── save_transformed_data.py
│
├── main.py # Master pipeline to run the entire workflow
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Malak Salem: Data loading, exploration, and reporting key patterns.
- Laila Shawky: Cleaning and feature engineering (timestamps, nulls, derived fields).
- Jumanah Rushdi: Visualizations and spatial clustering (K-Means on pickup points).
- Open source under the MIT License.
- Free to use, modify, and distribute with attribution.
