intelygenz · guillermoayllon · Sep 22, 2024
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,36 @@
+# Use the jupyter/pyspark-notebook image as a base image
+FROM jupyter/pyspark-notebook:latest
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ENV PATH=$JAVA_HOME/bin:$PATH
+ENV PYTHONPATH=/app
+
+# Install Java (required for Spark)
+USER root
+RUN apt-get update && \
+    apt-get install -y openjdk-11-jdk && \
+    apt-get clean
+
+# Install additional Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the application code
+COPY . /app
+
+# Set the working directory
+WORKDIR /app
+
+
+# Expose the port for the FastAPI application
+EXPOSE 8000
+
+# Set the entrypoint
+# Ensure the entrypoint script is executable
+RUN chmod +x /app/entrypoint.sh
+ENTRYPOINT ["/app/entrypoint.sh"]
+
+# Default command
+CMD ["api"]
diff --git a/SOLUTIONS.md b/SOLUTIONS.md
@@ -0,0 +1,116 @@
+# MLE Challenge Solution
+
+## Overview
+
+This challenge involves using Airbnb listing data to predict the price category for new listings. The task is divided into three main challenges:
+
+1. Refactor the development code for production.
+2. Build an API to use the trained model for predictions.
+3. Dockerize the solution for easy deployment and scalability.
+
+I worked on solving this challenge on September 22nd, from 2:00 PM to 8:00 PM.
+
+## Solution
+
+I created five modules: data_preprocessing, training, inference, configuration and tests.
+Appart from that I maintain the models and data folders.
+
+To build the project:
+
+```bash
+docker build -t the_real_mle_challenge .
+```
+
+### Data Preprocessing
+
+- Objective: Transform the exploratory Jupyter notebook code into a production script.
+- Changes:
+  - Migrated from pandas to PySpark to boost performance.
+  - Separated transformations into auxiliary functions for better readability.
+
+### Training
+
+- Objective: Implement the logic to build the classifier.
+- Changes:
+  - Called the data_preprocessing module.
+  - Performed additional data preprocessing steps specific to the model.
+  - Trained the model, computed and logged metrics, and serialized the model using Pickle.
+  - Refactored with the future goal of automating model retraining.
+  - To run the training module:
+
+```bash
+docker run -it --rm the_real_mle_challenge train
+```
+
+### Inference
+
+- Objective: Load the serialized model and perform inference based on API queries.
+- Implemented the API using a framework (e.g., FastAPI) to ensure it is easy to use and test.
+- The API receives input data, processes it, and returns the predicted price category.
+- To run the inference module:
+
+```bash
+docker run -it --rm -p 8000:8000 the_real_mle_challenge
+```
+
+You can check the API documentation here: [http://localhost:8000/docs](http://localhost:8000/docs).
+
+To make a request:
+
+```bash
+curl -X POST "http://localhost:8000/predict" \
+     -H "Content-Type: application/json" \
+     -d '{
+           "id": 1001,
+           "accommodates": 4,
+           "room_type": "Entire home/apt",
+           "beds": 2,
+           "bedrooms": 1,
+           "bathrooms": 2,
+           "neighbourhood": "Brooklyn",
+           "tv": 1,
+           "elevator": 1,
+           "internet": 0,
+           "latitude": 40.71383,
+           "longitude": -73.9658
+         }'
+```
+
+One gets this answer:
+
+```bash
+{"id":1001,"price_category":"High"}
+```
+
+### Tests
+
+- Objective: Implement automatic tests to ensure the code is reliable and works as expected.
+- Solution:
+- Implemented tests for the data_preprocessing module.
+- Used static data for input and expected output.
+- Tested the loading function and the preprocessing output.
+- To run the tests:
+
+```bash
+docker run -it --rm the_real_mle_challenge test
+```
+
+### Test Results
+
+First test: Passed.
+Second test: Failed with the following error:
+```bash
+DataFrame.iloc[:, 12] (column name="category") values are different (0.05025 %)
+At positional index 8695, first diff: 0 != nan
+
+----------------------------------------------------------------------
+Ran 2 tests in 16.557s
+```
+
+This issue may be due to a mistake when converting from pandas to PySpark. Further debugging is required, but the tests demonstrate how they would work in production.
+
+## Further Enhancements
+
+- Split the modules into separate Docker containers.
+- Build an ETL pipeline to feed upstream data and populate a relational database instead of writing .csv files or Spark partitions on disk.
+- Implement batch and real-time processing (with Kafka) according to latency needs.
diff --git a/api.py b/api.py
@@ -0,0 +1,76 @@
+from pathlib import Path
+from typing import Dict
+
+from fastapi import FastAPI
+from pydantic import BaseModel
+
+from inference.main import predict_price_category
+from logger import logger
+
+# Log the current directory
+current_directory = Path.cwd()
+logger.info(f"Current directory: {current_directory}")
+
+
+
+# Define the input data model
+class ListingInput(BaseModel):
+    id: int
+    accommodates: int
+    room_type: str
+    beds: int
+    bedrooms: int
+    bathrooms: int
+    neighbourhood: str
+    tv: int
+    elevator: int
+    internet: int
+    latitude: float
+    longitude: float
+
+# Define the output data model
+class ListingOutput(BaseModel):
+    id: int
+    price_category: str
+
+# Create the FastAPI application
+app = FastAPI()
+
+MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4}
+MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5}
+
+# Define the prediction endpoint
+@app.post("/predict", response_model=ListingOutput)
+def do_prediction(listing: ListingInput):
+    """
+    Predicts the price category for a given listing.
+    Args:
+        listing (ListingInput): An instance of ListingInput containing the details of the listing.
+    Returns:
+        ListingOutput: An instance of ListingOutput containing the listing ID and the predicted price category.
+    Raises:
+        KeyError: If the neighbourhood or room type in the listing is not found in the respective mapping dictionaries.
+    """
+
+    # Convert input data to DataFrame
+    data = {
+        "neighbourhood": [MAP_NEIGHB[listing.neighbourhood]],
+        "room_type": [MAP_ROOM_TYPE[listing.room_type]],
+        "accommodates": [listing.accommodates],
+        "bathrooms": [listing.bathrooms],
+        "bedrooms": [listing.bedrooms],
+        "beds": [listing.beds],
+        "tv": [listing.tv],
+        "elevator": [listing.elevator],
+        "internet": [listing.internet],
+        "latitude": [listing.latitude],
+        "longitude": [listing.longitude]
+    }
+
+
+    # Make prediction
+
+    price_category = predict_price_category(data)
+
+    # Return the result
+    return ListingOutput(id=listing.id, price_category=price_category)
diff --git a/configuration/__init__.py b/configuration/__init__.py
diff --git a/configuration/config.py b/configuration/config.py
@@ -0,0 +1,18 @@
+from pathlib import Path
+
+from logger import logger
+
+# Define directories
+DIR_REPO = Path.cwd()
+logger.info(f"Repository directory: {DIR_REPO}")
+DIR_DATA_PROCESSED = Path(DIR_REPO) / "data" / "processed"
+DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw"
+DIR_MODELS = Path(DIR_REPO) / "models"
+TESTS_DIR = Path(DIR_REPO) / "tests"
+
+# Define file paths
+FILEPATH_PROCESSED = DIR_DATA_PROCESSED / "preprocessed_listings_guillermo"
+FILEPATH_MODEL = DIR_MODELS / "simple_classifier_guillermo.pkl"
+FILEPATH_DATA = DIR_DATA_RAW / "listings.csv"
+FILEPATH_TEST_INPUT_DATA = TESTS_DIR / "listings.csv"
+FILEPATH_TEST_EXPECTED_DATA = TESTS_DIR / "expected_preprocessed_listings.csv"
diff --git a/data/processed/preprocessed_listings_guillermo/._SUCCESS.crc b/data/processed/preprocessed_listings_guillermo/._SUCCESS.crc
diff --git a/...rocessed_listings_guillermo/.part-00000-97f731e4-7e34-4e52-a81a-a4c6d3338744-c000.csv.crc b/...rocessed_listings_guillermo/.part-00000-97f731e4-7e34-4e52-a81a-a4c6d3338744-c000.csv.crc
diff --git a/data/processed/preprocessed_listings_guillermo/_SUCCESS b/data/processed/preprocessed_listings_guillermo/_SUCCESS