This pipeline is designed to convert Excel files with varying structures into SQL-ready data. The implementation focuses on handling real-world Excel files.
The system is divided into four Python modules, each handling a specific aspect of the pipeline:
The main entry point that:
- Creates and manages the directory structure (input, processed, output)
- Sets up logging with timestamps for traceability
- Iterates through Excel files and coordinates their processing
- Implements basic error handling to ensure one file's failure doesn't stop the entire pipeline
Handles the core Excel processing logic:
- Persists input files to a processed directory for safety
- Processes each sheet in the Excel file independently
- Implements multi-row header detection using a simple heuristic
- Transforms headers into SQL-compatible column names
- Outputs processed data as CSV files
Focuses on data type detection and normalization:
- Attempts numeric conversion first (integers, then floats)
- Handles date parsing with multiple format attempts
- Falls back to string type when other conversions fail
- Maintains consistent types within columns
- Uses pandas' built-in type inference capabilities
Provides shared functionality:
- Configures logging to both file and console
- Normalizes column names to be SQL-compatible
- Handles special character replacement in column names
The pipeline follows a simple, linear flow:
- Excel files are detected in the input directory
- Files are copied to a processed directory
- Each sheet is processed independently
- Normalized data is saved as CSV files in the output directory
- Sheet-level isolation (errors in one sheet don't affect others)
- Logging of errors with timestamps
- Continuation of processing despite individual failures
The current implementation has several constraints:
- Processes files sequentially, not in parallel
- Holds entire sheets in memory
- No automated testing implementation
- Basic data type inference
The system is designed for straightforward usage:
- Place Excel files in the input directory
- Run pipeline.py
- Retrieve processed CSV files from the output directory
- Check logs for any processing issues