A complete time series analysis pipeline built with Python - from messy data to business insights.
Portfolio project demonstrating data cleaning, forecasting, analysis, and business storytelling for data analyst/scientist positions.
Takes messy retail sales data spanning 3 years and transforms it into:
- Demonstrates data science workflow
- Shows data cleaning and validation skills
- Compares multiple forecasting approaches
- Extracts actionable business insights
- Creates visualizations
# Install dependencies
pip install -r requirements.txt
# Generate sample data
python generate_data.py
# Run complete analysis (3-5 minutes)
python run_all.py- 1,095 daily sales records (2021-2023)
- $1.38 million total revenue
- 69,927 customers
- 3 store locations, 4 product categories
- 22 missing values → Interpolated using time-series methods
- 3 duplicate records → Removed
- 5 outliers → Capped using IQR method
- Inconsistent store names → Standardized
| Model | MAE ($) | RMSE ($) | MAPE (%) |
|---|---|---|---|
| Moving Average | 187.31 | 214.69 | 14.28 |
| ARIMA | 196.18 | 223.36 | 14.97 |
Moving Average performed best for this dataset (lower error = better)
Growth:
- 14.8% YoY growth (2022)
- 14.4% YoY growth (2023)
- Consistent upward trend
Store Performance:
- Store_A: $537,027 (38.8% of sales)
- Store_B: $453,228 (32.8%)
- Store_C: $393,636 (28.4%)
Product Categories:
- Electronics: 30.7% of total sales
- Clothing: 26.5%
- Home & Garden: 23.8%
- Food & Beverage: 19.0%
Temporal Patterns:
- Best day: Sunday ($1,349 average)
- Best month: April (highest average sales)
- Weekend sales significantly higher than weekdays
time-series/
├── scripts/
│ ├── 01_data_cleaning.py
│ ├── 02_modeling_forecasting.py
│ ├── 03_analysis_insights.py
│ └── 04_executive_summary.py
│
├── data/ # Generated outputs
│ ├── cleaned_sales_data.csv
│ ├── model_comparison.csv
│ ├── key_insights.txt
│ └── *.png # 15+ visualizations
│
├── run_all.py
├── generate_data.py
└── requirements.txt
Cleans messy retail data:
- Identifies and fixes missing values (2% of data)
- Removes duplicate dates
- Standardizes store names
- Handles outliers using statistical methods
- Validates data quality
Outputs: cleaned_sales_data.csv + 2 visualizations
Builds forecasting models:
- Decomposes time series (trend, seasonality, residuals)
- Tests for stationarity (Augmented Dickey-Fuller)
- Trains Moving Average and ARIMA models
- Compares performance using MAE, RMSE, MAPE
Outputs: model_comparison.csv + 4 visualizations
Extracts business insights:
- Calculates KPIs (revenue, transactions, customer metrics)
- Compares store performance across 3 locations
- Analyzes 4 product categories
- Identifies weekly and monthly patterns
- Performs statistical tests (t-tests for significance)
Outputs: key_insights.txt + 7 visualizations
Creates executive report:
- Business KPI dashboard
- Key findings summary
- Model performance visualization
- Strategic recommendations
Outputs: executive_summary_metrics.csv + 3 visualizations
- Python: pandas, numpy, matplotlib, seaborn, scipy
- Time Series: statsmodels, ARIMA, decomposition
- Statistics: hypothesis testing, correlation analysis, outlier detection
- Data Cleaning: handling missing values, duplicates, outliers
- Data Visualization: 15+ different chart types
Raw Data → Clean → Explore → Model → Analyze → Report
- Data quality assessment
- Systematic cleaning and validation
- Statistical analysis
- Predictive modeling
- Business insight extraction
- Executive communication
- Interactive dashboard (Streamlit)
- Additional models (XGBoost, LSTM)
- A/B testing framework
- Anomaly detection
- REST API for predictions
Created: January 2026 Last Updated: January 2026









