Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: my solution in one big commit #4

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Use the jupyter/pyspark-notebook image as a base image
FROM jupyter/pyspark-notebook:latest

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH=$JAVA_HOME/bin:$PATH
ENV PYTHONPATH=/app

# Install Java (required for Spark)
USER root
RUN apt-get update && \
apt-get install -y openjdk-11-jdk && \
apt-get clean

# Install additional Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY . /app

# Set the working directory
WORKDIR /app


# Expose the port for the FastAPI application
EXPOSE 8000

# Set the entrypoint
# Ensure the entrypoint script is executable
RUN chmod +x /app/entrypoint.sh
ENTRYPOINT ["/app/entrypoint.sh"]

# Default command
CMD ["api"]
116 changes: 116 additions & 0 deletions SOLUTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# MLE Challenge Solution

## Overview

This challenge involves using Airbnb listing data to predict the price category for new listings. The task is divided into three main challenges:

1. Refactor the development code for production.
2. Build an API to use the trained model for predictions.
3. Dockerize the solution for easy deployment and scalability.

I worked on solving this challenge on September 22nd, from 2:00 PM to 8:00 PM.

## Solution

I created five modules: data_preprocessing, training, inference, configuration and tests.
Appart from that I maintain the models and data folders.

To build the project:

```bash
docker build -t the_real_mle_challenge .
```

### Data Preprocessing

- Objective: Transform the exploratory Jupyter notebook code into a production script.
- Changes:
- Migrated from pandas to PySpark to boost performance.
- Separated transformations into auxiliary functions for better readability.

### Training

- Objective: Implement the logic to build the classifier.
- Changes:
- Called the data_preprocessing module.
- Performed additional data preprocessing steps specific to the model.
- Trained the model, computed and logged metrics, and serialized the model using Pickle.
- Refactored with the future goal of automating model retraining.
- To run the training module:

```bash
docker run -it --rm the_real_mle_challenge train
```

### Inference

- Objective: Load the serialized model and perform inference based on API queries.
- Implemented the API using a framework (e.g., FastAPI) to ensure it is easy to use and test.
- The API receives input data, processes it, and returns the predicted price category.
- To run the inference module:

```bash
docker run -it --rm -p 8000:8000 the_real_mle_challenge
```

You can check the API documentation here: [http://localhost:8000/docs](http://localhost:8000/docs).

To make a request:

```bash
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"id": 1001,
"accommodates": 4,
"room_type": "Entire home/apt",
"beds": 2,
"bedrooms": 1,
"bathrooms": 2,
"neighbourhood": "Brooklyn",
"tv": 1,
"elevator": 1,
"internet": 0,
"latitude": 40.71383,
"longitude": -73.9658
}'
```

One gets this answer:

```bash
{"id":1001,"price_category":"High"}
```

### Tests

- Objective: Implement automatic tests to ensure the code is reliable and works as expected.
- Solution:
- Implemented tests for the data_preprocessing module.
- Used static data for input and expected output.
- Tested the loading function and the preprocessing output.
- To run the tests:

```bash
docker run -it --rm the_real_mle_challenge test
```

### Test Results

First test: Passed.
Second test: Failed with the following error:
```bash
DataFrame.iloc[:, 12] (column name="category") values are different (0.05025 %)
At positional index 8695, first diff: 0 != nan

----------------------------------------------------------------------
Ran 2 tests in 16.557s
```

This issue may be due to a mistake when converting from pandas to PySpark. Further debugging is required, but the tests demonstrate how they would work in production.

## Further Enhancements

- Split the modules into separate Docker containers.
- Build an ETL pipeline to feed upstream data and populate a relational database instead of writing .csv files or Spark partitions on disk.
- Implement batch and real-time processing (with Kafka) according to latency needs.
76 changes: 76 additions & 0 deletions api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
from pathlib import Path
from typing import Dict

from fastapi import FastAPI
from pydantic import BaseModel

from inference.main import predict_price_category
from logger import logger

# Log the current directory
current_directory = Path.cwd()
logger.info(f"Current directory: {current_directory}")



# Define the input data model
class ListingInput(BaseModel):
id: int
accommodates: int
room_type: str
beds: int
bedrooms: int
bathrooms: int
neighbourhood: str
tv: int
elevator: int
internet: int
latitude: float
longitude: float

# Define the output data model
class ListingOutput(BaseModel):
id: int
price_category: str

# Create the FastAPI application
app = FastAPI()

MAP_ROOM_TYPE = {"Shared room": 1, "Private room": 2, "Entire home/apt": 3, "Hotel room": 4}
MAP_NEIGHB = {"Bronx": 1, "Queens": 2, "Staten Island": 3, "Brooklyn": 4, "Manhattan": 5}

# Define the prediction endpoint
@app.post("/predict", response_model=ListingOutput)
def do_prediction(listing: ListingInput):
"""
Predicts the price category for a given listing.
Args:
listing (ListingInput): An instance of ListingInput containing the details of the listing.
Returns:
ListingOutput: An instance of ListingOutput containing the listing ID and the predicted price category.
Raises:
KeyError: If the neighbourhood or room type in the listing is not found in the respective mapping dictionaries.
"""

# Convert input data to DataFrame
data = {
"neighbourhood": [MAP_NEIGHB[listing.neighbourhood]],
"room_type": [MAP_ROOM_TYPE[listing.room_type]],
"accommodates": [listing.accommodates],
"bathrooms": [listing.bathrooms],
"bedrooms": [listing.bedrooms],
"beds": [listing.beds],
"tv": [listing.tv],
"elevator": [listing.elevator],
"internet": [listing.internet],
"latitude": [listing.latitude],
"longitude": [listing.longitude]
}


# Make prediction

price_category = predict_price_category(data)

# Return the result
return ListingOutput(id=listing.id, price_category=price_category)
Empty file added configuration/__init__.py
Empty file.
18 changes: 18 additions & 0 deletions configuration/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from pathlib import Path

from logger import logger

# Define directories
DIR_REPO = Path.cwd()
logger.info(f"Repository directory: {DIR_REPO}")
DIR_DATA_PROCESSED = Path(DIR_REPO) / "data" / "processed"
DIR_DATA_RAW = Path(DIR_REPO) / "data" / "raw"
DIR_MODELS = Path(DIR_REPO) / "models"
TESTS_DIR = Path(DIR_REPO) / "tests"

# Define file paths
FILEPATH_PROCESSED = DIR_DATA_PROCESSED / "preprocessed_listings_guillermo"
FILEPATH_MODEL = DIR_MODELS / "simple_classifier_guillermo.pkl"
FILEPATH_DATA = DIR_DATA_RAW / "listings.csv"
FILEPATH_TEST_INPUT_DATA = TESTS_DIR / "listings.csv"
FILEPATH_TEST_EXPECTED_DATA = TESTS_DIR / "expected_preprocessed_listings.csv"
Binary file not shown.
Binary file not shown.
Empty file.
Loading