Skip to content

startupskateboard/dataprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Excel Data Pipeline

The System Design Document can be found in DESIGN.md ***

A robust data ingestion pipeline that processes Excel files for SQL database ingestion.

Features

  • Dynamic schema inference
  • Data type validation and normalization
  • Complex date format handling
  • Nested header support
  • SQL-ready output
  • Comprehensive logging

Dependencies

  • pandas >= 2.0.0
  • openpyxl >= 3.1.0
  • python-dateutil >= 2.8.2

Installation

  1. Create a virtual environment:
python3 -m venv venv --without-pip
source venv/bin/activate
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Place your Excel files in the data/input directory. There is already a sample input file included.

  2. Run the pipeline:

python3 src/pipeline.py

The pipeline will:

  • Process all Excel files found in the input directory
  • Store processed files in data/processed
  • Generate output in data/output
  • Create logs in the logs directory

Thank you for your consideration of my application. It's a privilege and I am grateful for the opportunity.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages