Project Objective:
Build a professional‑grade, end‑to‑end MLOps platform targeted at vehicle insurance risk prediction. This project is designed not just to produce accurate models, but to demonstrate the ability to deliver ML as a repeatable, scalable, and maintainable software product. Key goals include:
- Ingest and curate real user data from a live MongoDB Atlas cluster.
- Validate data against a strict schema and measure quality metrics.
- Transform raw features into model‑ready representations with reusable transformers.
- Train, evaluate, and compare models automatically; serialize both models and performance metadata.
- Promote only better‑performing models to a persistent registry (AWS S3), enabling version control.
- Provide an accessible prediction API via Flask with consistent preprocessing for serving.
- Automate the full pipeline (build, test, deploy) using Docker and GitHub Actions backed by a self‑hosted EC2 runner.
Ultimately, the objective is to simulate a real‑world ML engineering workflow, showcasing architectural hygiene, cloud integration, DevOps practices, and a strong emphasis on reproducibility and auditability.
This repository represents a complete software engineering lifecycle for a machine learning product—ideal for technical reviewers and hiring managers to assess architecture, quality, and deployment expertise.
The project follows a modular architecture with clear separation of concerns. Below is a high‑level tree highlighting key folders and files:
Vehicle-Insurance-Domain--MLOps-project/
├── app.py # Flask app exposing training/prediction endpoints
├── src/ # Python package containing all business logic
│ ├── components/ # Reusable pipeline components (ingestion, validation, …)
│ ├── configuration/ # Helpers for MongoDB & AWS connections
│ ├── constants/ # Global constants and environment variables
│ ├── data_access/ # Data layer specific to MongoDB
│ ├── entity/ # Dataclasses for configs/artifacts (DTOs)
│ ├── exception/ # Custom exception hierarchy for robustness
│ ├── logger/ # Logging configuration & utilities
│ ├── pipline/ # Orchestrators for training & prediction
│ └── utils/ # Miscellaneous helpers (schema, file I/O)
├── config/ # YAML definitions (dataset schema, parameters)
├── notebook/ # Jupyter notebooks for EDA and MongoDB demo
├── artifact/ # Output from pipeline runs (timestamped)
├── static/ templates/ # Web assets for the Flask UI
├── Dockerfile # Containerization instructions
├── .github/workflows/ # CI/CD workflow definitions (AWS deployment)
├── requirements.txt # Python dependencies
├── setup.py, pyproject.toml # Packaging configuration for editable install
└── project_flow.txt # Step‑by‑step development notes (useful read)
💡 Tip: The
srcpackage is installed locally viapip install -e ., enabling imports likefrom src.components import DataIngestionthroughout the codebase.
The src directory houses all the business logic and is organized by responsibility. Each folder contains the code needed for a specific stage of the pipeline or utility.
components/– Individual pipeline steps implemented as classes (DataIngestion,DataValidation,DataTransformation,ModelTrainer,ModelEvaluation,ModelPusher). Each component defines ainitiate_<step>method and returns an artifact object.configuration/– Configuration helpers, such asmongo_db_connection.pyandaws_connection.py. These encapsulate connection logic to external services and read from environment variables.constants/– Central location for constants like database names, bucket keys, thresholds, and file path templates. Updating values here propagates across components.data_access/– Data layer that knows how to fetch raw data (MongoDB) and convert it to pandas DataFrame.proj1_data.pyencapsulates queries and transformation logic.entity/– Data Transfer Objects (DTOs) using@dataclass. Contains config classes (e.g.,DataIngestionConfig) and artifact classes (e.g.,ModelTrainerArtifact). An additionals3_estimator.pyholds AWS S3 helper logic.exception/– Defines custom exception types (InsuranceException) and utilities for wrapping and logging errors consistently.logger/– Logging setup, providing a module‑level logger used by all other packages. Helps maintain uniform log formatting.pipline/– High‑level orchestrators:training_pipeline.pysequences the six components;prediction_pipeline.pyhandles inference path.utils/– Miscellaneous helpers including schema validation, file I/O helpers, and common utilities used by multiple components.
Having this clean separation improves maintainability, enables unit testing of individual parts, and lets you swap or extend functionality (e.g., adding a new data source) without touching unrelated code.
Follow these steps to replicate the environment on your machine:
-
Environment creation
conda create -n vehicle python=3.10 -y conda activate vehicle pip install -r requirements.txt pip install -e . # makes `src` available as a package
-
Configuration
- Populate
config/schema.yamlwith your dataset's feature definitions and accepted ranges. - Add secrets via environment variables (
MONGODB_URL,AWS_ACCESS_KEY_ID, etc.). - Update
src/constants/__init__.pyif you change bucket names or thresholds.
- Populate
-
Verify local package
python -c "import src; print(src)")Ensuring the project is importable avoids circular import issues later.
-
Run demonstrations
- Execute
demo.pyto test logging, exception handling, and simple component runs. - Open notebooks in
notebook/to explore the data and interact with MongoDB.
- Execute
Data is stored and retrieved from a MongoDB Atlas cluster:
| Step | Description |
|---|---|
| 1 | Sign up for MongoDB Atlas; create a project and an M0 (free) cluster. |
| 2 | Add a database user and whitelist 0.0.0.0/0 for development. |
| 3 | Obtain the connection string, replace <password>, and set MONGODB_URL env variable. |
| 4 | Use notebook/mongoDB_demo.ipynb to upload sample data and inspect collections. |
The connection helper lives in src/configuration/mongo_db_connection.py and is used by src/data_access/proj1_data.py to pull documents and frame them as pandas.DataFrame objects.
Robust logging and exception handling ensure visibility when pipelines run:
- Logger:
src/logger/__init__.pyuses Python'sloggingmodule to create a reusable logger with timestamped, levelled messages. - Custom Exception:
src/exception/insurance_exception.pydefinesInsuranceExceptionthat wraps underlying errors, preserving stack traces and context. - Extensive unit tests in
tests/(if present) exercise both facilities.
⚠️ Allexceptblocks re‑raiseInsuranceExceptionto unify error handling across components.
Transforms raw MongoDB documents into a usable dataset:
- Constants: Declare field names, DB names, and file paths in
src/constants/__init__.py. - Configuration entity:
DataIngestionConfigcaptures parameters such as database name and export path. - Artifact entity:
DataIngestionArtifactrecords output locations and status. - Component:
src/components/data_ingestion.pycontainsDataIngestionclass with aninitiate_data_ingestion()method which:- Connects to MongoDB via the configuration helper.
- Reads raw documents.
- Converts them to a DataFrame and writes them to feature store CSV.
- Pipeline integration: The ingestion step is invoked by
src/pipline/training_pipeline.py, maintaining dependency order.
This modular design allows you to swap the data source (e.g., S3, PostgreSQL) by implementing a new component with the same interface.
Ensures data quality before any downstream processing:
- Schema defined in
config/schema.yamlincludes type expectations, allowed categories, and numerical ranges. src/utils/main_utils.pyhouses helpers (validate_schema,report_missing_values, etc.).src/components/data_validation.pycompares ingested data against the schema and generates a human‑readable report (artifact/report.yaml).- Validation artifacts are propagated through
training_pipelinefor logging and decision‑making (e.g., abort if critical errors).
Maintaining a separate validation component improves reproducibility and provides a clear audit trail when data drifts occur.
Prepares features for model consumption:
entity/estimator.pydefines classes likeStandardScalerEstimatororInsuranceModelInputthat store fitted transformers.src/components/data_transformation.pyapplies transformations such as scaling, encoding, and saves the transformed dataset and transformer object.- Outputs are recorded in
DataTransformationArtifactwhich includes paths to transformed files and the transformer pickle.
Transformer objects are versioned and later used by the prediction pipeline to ensure consistency between training and serving.
Trains a machine learning model using scikit‑learn (or your preferred library):
- Parameters (algorithm choice, hyperparameters, validation split) are stored in
config/model.yaml. src/components/model_trainer.pycontainsModelTrainerclass with methods liketrain_model()andevaluate_model().- Trained models are serialized (
.pkl) and their metadata stored inModelTrainerArtifact.
Evaluation metrics (accuracy, ROC AUC, etc.) are logged and later compared during the model evaluation stage.
This stage decides whether a newly trained model is worthy of promotion:
-
AWS prerequisites
- Create an IAM user with
AdministratorAccess. - Set
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin your environment. - Add the credentials and other constants (
MODEL_BUCKET_NAME, etc.) tosrc/constants/__init__.py. - Create an S3 bucket (
my-model-mlopsproj) and adjust public access settings.
- Create an IAM user with
-
Evaluation
src/components/model_evaluation.pycompares new and baseline models using a chosen metric and a threshold defined byMODEL_EVALUATION_CHANGED_THRESHOLD_SCORE.- The result (
ModelEvaluationArtifact) indicates whether the new model is promoted.
-
Pusher
- If approved,
src/components/model_pusher.pyusesentity/s3_estimator.pyto upload the model to S3 under the key defined inMODEL_PUSHER_S3_KEY. - Artifacts include the S3 URI and local archive path.
- If approved,
This separation allows you to implement alternative registries later (e.g., MLflow, DVC) by conforming to the same artifact contracts.
Provides real‑time predictions via REST:
src/pipline/prediction_pipeline.pyreads the transformer and model, applies them to incoming JSON requests, and returns predictions.app.py(Flask) defines endpoints:POST /predictaccepts feature payloads and returns probabilities/labels.GET /trainingtriggers the full training pipeline when data has changed (optional; used in demos).
- Basic HTML interface under
templates/vehicledata.htmlallows users to input values via browser.
The pipeline ensures the same preprocessing used during training is applied in production, preventing training/serving skew.
Automation keeps the system deployable with every commit:
| Feature | Description |
|---|---|
| Dockerfile | Builds a container housing the Flask app and pipeline. |
.github/workflows/aws.yaml |
Workflow to build, test, push image to ECR, and SSH‑deploy to EC2. |
| Self‑hosted runner | An EC2 Ubuntu instance running GitHub Actions jobs, giving full control over environment. |
| GitHub Secrets | Stores AWS creds, ECR_REPO URI, and optionally MONGODB_URL for training in CI. |
workflow in graph :
Developer │ │ git push ▼ GitHub │ ▼ GitHub Actions (CI) │ │ Build Docker Image │ ▼ Amazon ECR │ ▼ GitHub Actions (CD) │ ▼ EC2 Server │ ▼ Docker Container Running App
- Commit and push code → workflow triggered.
- Image built and pushed to ECR
vehicleprojrepo. - EC2 runner pulls the latest image and restarts the service.
- Security group rule opens port 5000 for external access.
⚙️ The self‑hosted runner setup commands are recorded in
project_flow.txtfor reproducibility.
- Ensure the required environment variables are set locally or via GitHub Secrets.
- Execute the training pipeline manually:
python -c "from src.pipline.training_pipeline import TrainingPipeline; TrainingPipeline().run_pipeline()" - Start the web server:
python app.py
- Access the UI at
http://localhost:5000orhttp://<EC2_IP>:5000in production.
Predictions can also be made programmatically using curl or Postman.
project_flow.txtcontains chronological development notes—you can follow it to understand design decisions and build the project yourself.- Notebooks in
notebook/provide exploratory data analysis and MongoDB interaction examples. artifact/stores every pipeline run's outputs, including data, models, and validation reports; it serves as an audit trail.- Logs in
logs/capture runtime information useful for debugging and monitoring.
- Comprehensive MLOps pipeline from ingest to deployment.
- Modular, testable components align with software engineering best practices.
- Cloud integration with AWS and MongoDB demonstrates real‑world skills.
- CI/CD & automation using Docker and GitHub Actions ensures production readiness.
- Documentation & reproducibility make it easy for reviewers to verify and extend the work.
Feel free to clone the repository and explore—this project is designed to be read, run, and expanded upon by hiring managers and collaborators alike. Happy exploring! 👏