Dataruns is a Python-based library designed for seamless data extraction, transformation, and pipeline creation. It supports multiple data sources like CSV, SQLite, and Excel, and provides tools for building and applying data processing pipelines with a comprehensive set of data transformations.
- Data Extraction: Extract data from CSV, SQLite, and Excel files
- Pipeline Creation: Build and execute custom data processing pipelines
- Data Transformations: Comprehensive set of data transforms including:
- Scaling (StandardScaler, MinMaxScaler)
- Missing value handling (DropNA, FillNA)
- Column operations (SelectColumns, RenameColumns)
- One-hot encoding for categorical variables
- Pipeline Composition: Chain multiple transforms and operations
- Flexible Integration: Works with pandas DataFrames and numpy arrays
- Experimental Features: Performance optimization features 😗
Clone the repository and install the required dependencies using uv:
pip install datatruns
or
uv pip install datarunsOr, if you prefer to work with the latest development version, you can clone the repository and set up a virtual environment:
git clone -b dev https://github.com/DanielUgoAli/Dataruns.git
cd Dataruns
# Create a virtual environment with uv
uv venv
# Activate the virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS/Linux:
# source .venv/bin/activate
# Install dependencies
uv syncAlternatively, you can install in development mode:
# Install in editable mode with all dependencies
uv pip install -e .from dataruns.source import CSVSource
# Extract data from CSV
source = CSVSource(file_path='data.csv')
dataframe = source.extract_data()
print(dataframe.head())from dataruns.core.pipeline import Pipeline
# Create a simple transformation function
def transform_function(data):
return data * 2
# Build and use pipeline
pipeline = Pipeline(transform_function)
result = pipeline([1, 2, 3, 4, 5])
print(result) # [2, 4, 6, 8, 10]from dataruns.core.transforms import StandardScaler, FillNA, TransformComposer
import pandas as pd
# Create sample data
data = pd.DataFrame({
'feature1': [1, 2, None, 4, 5],
'feature2': [10, 20, 30, 40, 50]
})
# Create a preprocessing pipeline
preprocessor = TransformComposer(
FillNA(method='mean'),
StandardScaler()
)
# Apply transformations
processed_data = preprocessor.fit_transform(data)
print(processed_data)from dataruns.core.pipeline import Pipeline
from dataruns.core.transforms import create_preprocessing_pipeline
# Create preprocessing pipeline
preprocessing = create_preprocessing_pipeline(
scale_method='minmax',
handle_missing='fill'
)
# Function to apply preprocessing
def preprocess_data(data):
numerical_data = data.select_dtypes(include=['number'])
return preprocessing.fit_transform(numerical_data)
# Integrate with dataruns pipeline
pipeline = Pipeline(preprocess_data)
result = pipeline(your_dataframe)Comprehensive examples are available as Jupyter notebooks in the examples/ folder:
basic_usage_examples.ipynb: Basic pipeline and data extraction examplestransform_examples.ipynb: Data transformation examples with various scalers and preprocessorscomprehensive_transform_examples.ipynb: Advanced examples showing complex pipelines and custom transforms
To run the examples, you can use Jupyter or open them directly in VS Code:
# Install Jupyter if not already installed
uv add jupyter
# Launch Jupyter Lab
jupyter lab examples/
# Or run a specific notebook
jupyter notebook examples/basic_usage_examples.ipynbAlternatively, you can open the .ipynb files directly in VS Code with the Jupyter extension for an integrated experience.
Run the test suite to verify functionality:
# Test core pipeline functionality
python tests/test_core_functionality.py
# Test transform functionality
python tests/test_transforms.pydataruns/
├── src/dataruns/ # Source code
│ ├── core/ # Core pipeline and transform functionality
│ └── source/ # Data source extractors
├── examples/ # Usage examples and demos
├── tests/ # Test suite
└── README.md # This file
Pipeline: Main pipeline class for chaining functionsMake_Pipeline: Builder class for constructing pipelinesFunction: Wrapper class for consistent function handling
Transform: Abstract base class for all transformsStandardScaler: Standardize features (mean=0, std=1)MinMaxScaler: Scale features to a specified rangeDropNA: Remove rows/columns with missing valuesFillNA: Fill missing values with specified strategySelectColumns: Select specific columns from DataFrameTransformComposer: Chain multiple transforms together
CSVSource: Extract data from CSV filesDataSource: Base class for data extractors
Any editing should be done on the dev branch. Help is much appreciated! Please open a pull request to contribute. 😗
This project is licensed under the MIT License - see the LICENSE file for details.