This project processes and analyzes recipe data using Apache Spark. The application consists of two main tasks: preprocessing the raw data and performing analysis to extract insights about recipes containing beef and their cooking durations.
- Python 3.8+
- Apache Spark
Note: For Using Apache Spark Java Develop Kit(JDK) in mandatory to working with Apache Spark.
-
Clone the repository:
git clone https://github.com/AnilkumarBorra/spark.git cd recipe_analysis
-
Create a virtual environment and install dependencies:
python3 -m venv spark-env source spark-env/bin/activate pip install -r requirements.txt
-
Run the application locally:
python src/main.py
-
Run the application using Docker:
docker build -t recipe-analysis . docker run recipe-analysis
- Ensure the input data (
recipes-000.json,recipes-001.json,recipes-002.json
) is located in theinput
folder. - The processed data and analysis results will be written to the
output
folder.
Unit tests are located in the tests
directory. To run the tests:
pytest tests