This project demonstrates a robust, modular, and reproducible end-to-end machine learning pipeline for text classification (spam detection), leveraging DVC for data and experiment versioning, and AWS S3 for scalable remote storage. The pipeline is designed for extensibility, transparency, and ease of collaboration.
- src/: Contains modular Python scripts for each pipeline stage:
data_ingestion.py
: Downloads and splits raw data.data_preprocessing.py
: Cleans, encodes, and normalizes text data.feature_engineering.py
: Extracts features using TF-IDF vectorization.model_building.py
: Trains a RandomForest model.model_evaluation.py
: Evaluates the model and logs metrics.
- experiments/: Contains Jupyter notebooks and datasets for experimentation.
- logs/: Stores logs for each pipeline stage for traceability.
- dvc.yaml: Defines the DVC pipeline stages and their dependencies.
- params.yaml: Centralized configuration for pipeline parameters.
- dvclive/: Stores experiment metrics and plots for tracking.
- .dvc/config: Configures DVC to use AWS S3 as the remote storage backend.
-
Data Ingestion
- Downloads the dataset and splits it into train/test sets.
- Parameters (in
params.yaml
): test size. - Output:
data/raw/
-
Data Preprocessing
- Cleans text, removes duplicates, encodes labels, and normalizes.
- Output:
data/interim/
-
Feature Engineering
- Applies TF-IDF vectorization to text data.
- Parameter: max_features.
- Output:
data/processed/
-
Model Building
- Trains a RandomForestClassifier.
- Parameters: n_estimators, random_state.
- Output:
models/model.pkl
-
Model Evaluation
- Evaluates the model using accuracy, precision, recall, and AUC.
- Logs metrics with dvclive and saves to
reports/metrics.json
.
- DVC is used to:
- Track data, models, and metrics.
- Define and reproduce pipeline stages.
- Version control large files and datasets.
- AWS S3 is configured as the DVC remote (
s3://subhdvc
), enabling scalable, cloud-based storage and collaboration.
- Programming Language: Python 3.x
- ML Libraries: scikit-learn, pandas, numpy, nltk
- Feature Extraction: TfidfVectorizer
- Model: RandomForestClassifier
- Experiment Tracking: dvclive
- Pipeline & Data Versioning: DVC
- Remote Storage: AWS S3
- Logging: Python logging module
- Jupyter Notebook: For experimentation and EDA
-
Install dependencies
(You may need to create arequirements.txt
if not present. Typical dependencies: scikit-learn, pandas, numpy, nltk, dvc[s3], dvclive, pyyaml, etc.) -
Configure AWS S3 credentials
Ensure your AWS credentials are set up for DVC to access S3. -
Reproduce the pipeline
dvc repro
-
Track experiments and metrics
- Metrics and plots are available in
dvclive/
andreports/metrics.json
.
- Metrics and plots are available in
-
Push/pull data and models
dvc push # Uploads data/models to S3 dvc pull # Downloads data/models from S3
- Modify
params.yaml
to tune hyperparameters and pipeline settings. - Add new stages or scripts in
src/
as needed. - Use DVC to add new data or models for versioning.
- All pipeline steps are modular and logged for traceability.
- Data, models, and metrics are versioned for reproducibility.
- Remote storage (S3) enables team collaboration and scalability.
- Experiment tracking with dvclive for ML workflow transparency.
This project is licensed under the GPL-3.0 License.