intelygenz · polalbacar · Nov 13, 2024 · Nov 13, 2024 · Nov 13, 2024 · Nov 14, 2024
diff --git a/.DS_Store b/.DS_Store
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,14 @@
+# Ignore unnecessary files
+.git
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.vscode
+*.egg-info
+.DS_Store
+.idea
+
+# Ignore virtual environment and environment variables
+venv/
+.env
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,11 @@
 Pipfile
 
-.vscode/
+.vscode/
+
+# Ignore Python virtual environment folder
+venv/
+
+# Ignore models folder
+# models/
+
+__pycache__/
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,28 @@
+# Use an official Python image as the base
+FROM python:3.10-slim
+
+# Set environment variables to avoid Python buffer issues
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONDONTWRITEBYTECODE=1
+
+# Set the working directory inside the container
+WORKDIR /app
+
+# Add /app and /app/api to PYTHONPATH
+ENV PYTHONPATH=/app:/app/api
+
+# Copy only requirements.txt first to leverage Docker layer caching
+COPY requirements.txt /app/
+
+# Install production dependencies only
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the entire application code
+COPY api /app/api
+COPY models /models
+
+# Expose port 8000 to access FastAPI
+EXPOSE 8000
+
+# Command to run the API using Uvicorn
+CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
diff --git a/HowToDeploy.md b/HowToDeploy.md
@@ -0,0 +1,67 @@
+# Property Price Prediction API
+
+This project provides a FastAPI-based machine learning API for predicting property price categories based on various features of a listing. The API classifies listings into categories such as Low, Mid, High, and Luxury. 
+The application is containerized using Docker to simplify deployment.
+
+## Prerequisites
+
+- **Docker**: Make sure Docker is installed on your machine. You can download it from [https://www.docker.com/get-started](https://www.docker.com/get-started).
+
+## Running the API with Docker
+
+### 1. Clone the Repository
+
+Clone this repository to your local machine:
+
+```bash
+git clone https://github.com/polalbacar/the-real-mle-challenge.git
+cd the-real-mle-challenge
+```
+
+### 2. Build the Docker Image
+
+Build the Docker image using the following command. This will create an image named `ny-estimator-api` (you can change the name if desired):
+
+```bash
+docker build -t ny-estimator-api .
+```
+
+### 3. Run the Docker Container
+
+Run the container with:
+
+```bash
+docker run -d -p 8000:8000 ny-estimator-api
+```
+
+- `-d` runs the container in detached mode.
+- `-p 8000:8000` maps port 8000 on your machine to port 8000 in the container, where the FastAPI application is running.
+
+### 4. Access the API
+
+Once the container is running, you can access the API:
+
+- **Swagger UI** (Interactive Documentation): [http://localhost:8000/docs](http://localhost:8000/docs)
+- **Redoc Documentation**: [http://localhost:8000/redoc](http://localhost:8000/redoc)
+
+These links provide an interactive interface to explore and test the API endpoints.
+
+### Stopping the Container
+
+To stop the Docker container, use:
+
+```bash
+docker stop $(docker ps -q --filter ancestor=ny-estimator-api)
+```
+
+Or if you know the container ID:
+
+```bash
+docker stop <container_id>
+```
+
+## TLDR
+
+- **Build** the image with `docker build -t ny-estimator-api .`
+- **Run** the container with `docker run -d -p 8000:8000 ny-estimator-api`
+- **Access** the API documentation at [http://localhost:8000/docs](http://localhost:8000/docs)
diff --git a/SOLUTIONS.md b/SOLUTIONS.md
@@ -0,0 +1,69 @@
+# SOLUTIONS.md
+
+This document explains the solutions implemented for the Inteligenz technical assessment, detailing the approach, methodology, and justifications for each of the three challenges.
+
+## Challenge 1 - Refactor DEV Code
+
+### Objective
+The objective of this challenge was to take exploratory data science code from Jupyter notebooks and refactor it into a production-ready structure. This required converting it into modular, reusable, and optimized Python scripts that meet production standards.
+
+### Solution
+
+- I split the notebook code into separate scripts within the `scripts` directory, each with a purpose:
+    - `data_loader.py`: Handles loading raw and preprocessed data.
+    - `preprocessing.py`: Contains functions for cleaning and transforming data.
+    - `model.py`: Contains functions for training, evaluating, saving, and loading the model.
+    - `utils.py`: Provides utility functions, such as plotting and feature extraction.
+    - `main.py`: Serves as the main script, running the complete ML pipeline, from data loading and preprocessing to training and evaluation.
+
+- The `main.py` file is the central script and the only file necessary to run the entire pipeline. It:
+     - Loads raw Airbnb listings data.
+     - Preprocesses the raw data and saves the cleaned dataset.
+     - Loads the preprocessed data and splits it for training and testing.
+     - Trains the model using the processed data.
+     - Evaluates the model and saves the evaluation results, including feature importance and confusion matrix.
+     - Saves the trained model in a timestamped folder under `models`, along with its evaluation results. This setup ensures that each trained model is saved with its associated evaluation, allowing for comparisons between models. By keeping a history of models and their performance, we can identify and select the best-performing model.
+
+
+- Each script has clear docstrings and comments to ensure that other Machine Learning Engineers (MLEs) can easily understand the code and purpose of each function.
+- The code follows the PEP-8 style guide for readability, and functions are logically organized by responsibility. This modular approach makes it easier to test and maintain each part independently.
+- I structured the code to facilitate testing, ensuring that each function performs a single, well-defined task. This makes it easy to test each component in isolation.
+
+---
+
+## Challenge 2 - Build an API
+
+### Objective
+The goal was to build an API using FastAPI to serve the trained model and classify a property listing based on input features. The API needed to be user-friendly, easy to test, and straightforward to call locally.
+
+### Solution
+
+- I chose **FastAPI** due to its speed, ease of use, and automatic interactive documentation with Swagger UI. FastAPI is also well-suited for serving machine learning models and offers simple request validation with Pydantic.
+- The API exposes a `/predict` endpoint that accepts JSON input with features such as `id`, `accommodates`, `room_type`, and `neighbourhood`.
+- The API returns a JSON output with the `id` and the predicted `price_category`, which is one of `Low`, `Mid`, `High`, or `Luxury`.
+- Input validation is handled with Pydantic, ensuring that required fields are present and correctly formatted.
+- The API preprocesses incoming data to match the format expected by the model. This includes encoding categorical variables (e.g., `room_type` and `neighbourhood`), which are mapped to integers as per the training data.
+- The API is easy to use locally, and the Swagger UI, accessible at `/docs`, allows for interactive testing.
+
+---
+
+## Challenge 3 - Dockerize Your Solution
+
+### Objective
+The objective was to Dockerize the API for ease of deployment and scalability. Docker ensures a consistent environment, making it easier to deploy the application in different environments with the same setup.
+
+### Solution
+
+- I created a **Dockerfile** that defines the environment needed to run the API. The Dockerfile uses a lightweight `python:3.10-slim` base image to minimize the image size and it is enough for our case.
+- The Dockerfile is structured to use Docker layer caching by copying only `requirements.txt` first, installing dependencies, and then copying the application code.
+- The Docker container exposes port 8000, which maps to the FastAPI application’s port.
+- A `CMD` command is used to start the FastAPI application with Uvicorn, listening on all interfaces to make it accessible outside the container.
+
+3. **Usage Instructions**:
+   - I included detailed instructions in `Read HowToDeploy.md` on building the Docker image and running the container.
+   - Users can easily build the image with `docker build -t property-price-api .` and run it with `docker run -d -p 8000:8000 property-price-api`.
+
+4. **Testing the Dockerized API**:
+   - Once running, users can access the API documentation at `http://localhost:8000/docs` and test the endpoint.
+
+---
diff --git a/api/config.yaml b/api/config.yaml
@@ -0,0 +1,3 @@
+# Configuration file for the API
+model:
+  path: "../models/model_2024-11-13_18-52-17/model.pkl"
diff --git a/api/main.py b/api/main.py
@@ -0,0 +1,54 @@
+from fastapi import FastAPI, HTTPException
+import pickle
+import pandas as pd
+from utils import preprocess_listing_input
+from pydantic_models import ListingInput, ListingOutput
+import yaml
+
+# Load configuration from YAML file
+def load_config():
+    with open("./api/config.yaml", "r") as f:
+        return yaml.safe_load(f)
+
+config = load_config()
+print("Loaded configuration", config)
+
+# Load the trained model
+with open(config['model']['path'], "rb") as f:
+    model = pickle.load(f)
+    print("Loaded model:", model)
+
+# Create FastAPI app instance
+app = FastAPI()
+
+# Endpoint to make predictions
+@app.post("/predict")
+def predict_price_category(listing: ListingInput):
+    """
+    Predict the price category of a property listing.
+
+    Parameters:
+    - listing (ListingInput): Input data containing property details for prediction.
+
+    Returns:
+    - ListingOutput: Response with property 'id' and predicted 'price_category'.
+    """
+
+    # Preprocess input data for the model
+    # model_input_data = preprocess_listing_input(listing.dict())
+    model_input_data = preprocess_listing_input(listing.model_dump())
+
+    try:
+        # Make the prediction
+        category_num = model.predict(model_input_data)[0]  # Get the first prediction
+
+        # Map numeric categories to text labels
+        category_map = {0: "Low", 1: "Mid", 2: "High", 3: "Luxury"}
+        price_category = category_map.get(category_num, "Unknown")
+
+        # Return the result in the expected format
+        return ListingOutput(id=listing.id, price_category=price_category)
+
+    except Exception as e:
+        # Raise an HTTP exception if an error occurs during prediction
+        raise HTTPException(status_code=500, detail=str(e))
diff --git a/api/pydantic_models.py b/api/pydantic_models.py
@@ -0,0 +1,24 @@
+from pydantic import BaseModel
+from typing import Literal
+
+# Define input model with Pydantic
+class ListingInput(BaseModel):
+    """Pydantic model to validate input data for a property listing."""
+    id: int
+    accommodates: int
+    room_type: Literal["Shared room", "Private room", "Entire home/apt", "Hotel room"]
+    beds: int
+    bedrooms: int
+    bathrooms: int
+    neighbourhood: Literal["Bronx", "Queens", "Staten Island", "Brooklyn", "Manhattan"]
+    tv: int
+    elevator: int
+    internet: int
+    latitude: float
+    longitude: float
+
+# Define output model with Pydantic
+class ListingOutput(BaseModel):
+    """Pydantic model to structure the output response for price category prediction."""
+    id: int
+    price_category: Literal["Low", "Mid", "High", "Luxury"]
diff --git a/api/utils.py b/api/utils.py
@@ -0,0 +1,27 @@
+import pandas as pd
+
+# Preprocess the data to match the model's expected features
+def preprocess_listing_input(input_listing: dict) -> pd.DataFrame:
+    """
+    Preprocess input data to match the features expected by the trained model.
+
+    Parameters:
+    - input_listing (dict): Dictionary containing the raw input listing data.
+
+    Returns:
+    - pd.DataFrame: Preprocessed DataFrame with selected and transformed features.
+    """
+
+    # Convert input data to a DataFrame
+    df = pd.DataFrame([input_listing])
+
+    # Map categorical features
+    MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4}
+    MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5}
+    df["neighbourhood"] = df["neighbourhood"].map(MAP_NEIGHB)
+    df["room_type"] = df["room_type"].map(MAP_ROOM_TYPE)
+
+    # Define the expected columns in the order required by the model
+    FEATURE_NAMES = ['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms']
+    df = df[FEATURE_NAMES]
+    return df
diff --git a/data/.DS_Store b/data/.DS_Store