Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed ML Challenge #9

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
14 changes: 14 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Ignore unnecessary files
.git
__pycache__
*.pyc
*.pyo
*.pyd
.vscode
*.egg-info
.DS_Store
.idea

# Ignore virtual environment and environment variables
venv/
.env
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
Pipfile

.vscode/
.vscode/

# Ignore Python virtual environment folder
venv/

# Ignore models folder
# models/

__pycache__/
28 changes: 28 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Use an official Python image as the base
FROM python:3.10-slim

# Set environment variables to avoid Python buffer issues
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1

# Set the working directory inside the container
WORKDIR /app

# Add /app and /app/api to PYTHONPATH
ENV PYTHONPATH=/app:/app/api

# Copy only requirements.txt first to leverage Docker layer caching
COPY requirements.txt /app/

# Install production dependencies only
RUN pip install --no-cache-dir -r requirements.txt

# Copy the entire application code
COPY api /app/api
COPY models /models

# Expose port 8000 to access FastAPI
EXPOSE 8000

# Command to run the API using Uvicorn
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
67 changes: 67 additions & 0 deletions HowToDeploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Property Price Prediction API

This project provides a FastAPI-based machine learning API for predicting property price categories based on various features of a listing. The API classifies listings into categories such as Low, Mid, High, and Luxury.
The application is containerized using Docker to simplify deployment.

## Prerequisites

- **Docker**: Make sure Docker is installed on your machine. You can download it from [https://www.docker.com/get-started](https://www.docker.com/get-started).

## Running the API with Docker

### 1. Clone the Repository

Clone this repository to your local machine:

```bash
git clone https://github.com/polalbacar/the-real-mle-challenge.git
cd the-real-mle-challenge
```

### 2. Build the Docker Image

Build the Docker image using the following command. This will create an image named `ny-estimator-api` (you can change the name if desired):

```bash
docker build -t ny-estimator-api .
```

### 3. Run the Docker Container

Run the container with:

```bash
docker run -d -p 8000:8000 ny-estimator-api
```

- `-d` runs the container in detached mode.
- `-p 8000:8000` maps port 8000 on your machine to port 8000 in the container, where the FastAPI application is running.

### 4. Access the API

Once the container is running, you can access the API:

- **Swagger UI** (Interactive Documentation): [http://localhost:8000/docs](http://localhost:8000/docs)
- **Redoc Documentation**: [http://localhost:8000/redoc](http://localhost:8000/redoc)

These links provide an interactive interface to explore and test the API endpoints.

### Stopping the Container

To stop the Docker container, use:

```bash
docker stop $(docker ps -q --filter ancestor=ny-estimator-api)
```

Or if you know the container ID:

```bash
docker stop <container_id>
```

## TLDR

- **Build** the image with `docker build -t ny-estimator-api .`
- **Run** the container with `docker run -d -p 8000:8000 ny-estimator-api`
- **Access** the API documentation at [http://localhost:8000/docs](http://localhost:8000/docs)
69 changes: 69 additions & 0 deletions SOLUTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# SOLUTIONS.md

This document explains the solutions implemented for the Inteligenz technical assessment, detailing the approach, methodology, and justifications for each of the three challenges.

## Challenge 1 - Refactor DEV Code

### Objective
The objective of this challenge was to take exploratory data science code from Jupyter notebooks and refactor it into a production-ready structure. This required converting it into modular, reusable, and optimized Python scripts that meet production standards.

### Solution

- I split the notebook code into separate scripts within the `scripts` directory, each with a purpose:
- `data_loader.py`: Handles loading raw and preprocessed data.
- `preprocessing.py`: Contains functions for cleaning and transforming data.
- `model.py`: Contains functions for training, evaluating, saving, and loading the model.
- `utils.py`: Provides utility functions, such as plotting and feature extraction.
- `main.py`: Serves as the main script, running the complete ML pipeline, from data loading and preprocessing to training and evaluation.

- The `main.py` file is the central script and the only file necessary to run the entire pipeline. It:
- Loads raw Airbnb listings data.
- Preprocesses the raw data and saves the cleaned dataset.
- Loads the preprocessed data and splits it for training and testing.
- Trains the model using the processed data.
- Evaluates the model and saves the evaluation results, including feature importance and confusion matrix.
- Saves the trained model in a timestamped folder under `models`, along with its evaluation results. This setup ensures that each trained model is saved with its associated evaluation, allowing for comparisons between models. By keeping a history of models and their performance, we can identify and select the best-performing model.


- Each script has clear docstrings and comments to ensure that other Machine Learning Engineers (MLEs) can easily understand the code and purpose of each function.
- The code follows the PEP-8 style guide for readability, and functions are logically organized by responsibility. This modular approach makes it easier to test and maintain each part independently.
- I structured the code to facilitate testing, ensuring that each function performs a single, well-defined task. This makes it easy to test each component in isolation.

---

## Challenge 2 - Build an API

### Objective
The goal was to build an API using FastAPI to serve the trained model and classify a property listing based on input features. The API needed to be user-friendly, easy to test, and straightforward to call locally.

### Solution

- I chose **FastAPI** due to its speed, ease of use, and automatic interactive documentation with Swagger UI. FastAPI is also well-suited for serving machine learning models and offers simple request validation with Pydantic.
- The API exposes a `/predict` endpoint that accepts JSON input with features such as `id`, `accommodates`, `room_type`, and `neighbourhood`.
- The API returns a JSON output with the `id` and the predicted `price_category`, which is one of `Low`, `Mid`, `High`, or `Luxury`.
- Input validation is handled with Pydantic, ensuring that required fields are present and correctly formatted.
- The API preprocesses incoming data to match the format expected by the model. This includes encoding categorical variables (e.g., `room_type` and `neighbourhood`), which are mapped to integers as per the training data.
- The API is easy to use locally, and the Swagger UI, accessible at `/docs`, allows for interactive testing.

---

## Challenge 3 - Dockerize Your Solution

### Objective
The objective was to Dockerize the API for ease of deployment and scalability. Docker ensures a consistent environment, making it easier to deploy the application in different environments with the same setup.

### Solution

- I created a **Dockerfile** that defines the environment needed to run the API. The Dockerfile uses a lightweight `python:3.10-slim` base image to minimize the image size and it is enough for our case.
- The Dockerfile is structured to use Docker layer caching by copying only `requirements.txt` first, installing dependencies, and then copying the application code.
- The Docker container exposes port 8000, which maps to the FastAPI application’s port.
- A `CMD` command is used to start the FastAPI application with Uvicorn, listening on all interfaces to make it accessible outside the container.

3. **Usage Instructions**:
- I included detailed instructions in `Read HowToDeploy.md` on building the Docker image and running the container.
- Users can easily build the image with `docker build -t property-price-api .` and run it with `docker run -d -p 8000:8000 property-price-api`.

4. **Testing the Dockerized API**:
- Once running, users can access the API documentation at `http://localhost:8000/docs` and test the endpoint.

---
3 changes: 3 additions & 0 deletions api/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Configuration file for the API
model:
path: "../models/model_2024-11-13_18-52-17/model.pkl"
54 changes: 54 additions & 0 deletions api/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from fastapi import FastAPI, HTTPException
import pickle
import pandas as pd
from utils import preprocess_listing_input
from pydantic_models import ListingInput, ListingOutput
import yaml

# Load configuration from YAML file
def load_config():
with open("./api/config.yaml", "r") as f:
return yaml.safe_load(f)

config = load_config()
print("Loaded configuration", config)

# Load the trained model
with open(config['model']['path'], "rb") as f:
model = pickle.load(f)
print("Loaded model:", model)

# Create FastAPI app instance
app = FastAPI()

# Endpoint to make predictions
@app.post("/predict")
def predict_price_category(listing: ListingInput):
"""
Predict the price category of a property listing.

Parameters:
- listing (ListingInput): Input data containing property details for prediction.

Returns:
- ListingOutput: Response with property 'id' and predicted 'price_category'.
"""

# Preprocess input data for the model
# model_input_data = preprocess_listing_input(listing.dict())
model_input_data = preprocess_listing_input(listing.model_dump())

try:
# Make the prediction
category_num = model.predict(model_input_data)[0] # Get the first prediction

# Map numeric categories to text labels
category_map = {0: "Low", 1: "Mid", 2: "High", 3: "Luxury"}
price_category = category_map.get(category_num, "Unknown")

# Return the result in the expected format
return ListingOutput(id=listing.id, price_category=price_category)

except Exception as e:
# Raise an HTTP exception if an error occurs during prediction
raise HTTPException(status_code=500, detail=str(e))
24 changes: 24 additions & 0 deletions api/pydantic_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from pydantic import BaseModel
from typing import Literal

# Define input model with Pydantic
class ListingInput(BaseModel):
"""Pydantic model to validate input data for a property listing."""
id: int
accommodates: int
room_type: Literal["Shared room", "Private room", "Entire home/apt", "Hotel room"]
beds: int
bedrooms: int
bathrooms: int
neighbourhood: Literal["Bronx", "Queens", "Staten Island", "Brooklyn", "Manhattan"]
tv: int
elevator: int
internet: int
latitude: float
longitude: float

# Define output model with Pydantic
class ListingOutput(BaseModel):
"""Pydantic model to structure the output response for price category prediction."""
id: int
price_category: Literal["Low", "Mid", "High", "Luxury"]
27 changes: 27 additions & 0 deletions api/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import pandas as pd

# Preprocess the data to match the model's expected features
def preprocess_listing_input(input_listing: dict) -> pd.DataFrame:
"""
Preprocess input data to match the features expected by the trained model.

Parameters:
- input_listing (dict): Dictionary containing the raw input listing data.

Returns:
- pd.DataFrame: Preprocessed DataFrame with selected and transformed features.
"""

# Convert input data to a DataFrame
df = pd.DataFrame([input_listing])

# Map categorical features
MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4}
MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5}
df["neighbourhood"] = df["neighbourhood"].map(MAP_NEIGHB)
df["room_type"] = df["room_type"].map(MAP_ROOM_TYPE)

# Define the expected columns in the order required by the model
FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms']
df = df[FEATURE_NAMES]
return df
Binary file added data/.DS_Store
Binary file not shown.
Loading