This is an academic project which aim to create a data streaming pipeline using Spark Structured Streaming, Elasticsearch and Kibana. The project uses real-world production logs from NASA then process it using Spark Structured Streaming. The output is stored on Elasticsearch and visualized in a Dashboard using Kibana.
Resources:
- Data: NASA-HTTP
References:
This project also inspired by several similar projects:
- Scalable Web Server Log Analytics with Apache Spark
- Web-Server-Log-Analysis-with-PySpark
- Log Analysis 📈 with Spark Streaming 📺, ElasticSearch 📊 and Kibana 👀 :
Required:
- Python 3.8
- docker, docker-compose
# Virtual environment
python3 -m venv ./venv
source ./venv/bin/activate
# Install libraries
pip install -r requirements.txt
Create .env
file follow the format in .env.sample
cp .env.sample .env
Download NASA logs data:
chmod +x *.sh // make files executable
./prepare.sh // download
Elasticsearch and Kibana:
docker-compose -f elasticsearch-kibana-compose.yaml up
Note: Elasticsearch node runs on port 9200 (ES_PORT
in .env
) (https://localhost:9200), Kibana
5601 (https://localhost:5601) (KIBANA_PORT
in .env
).
Simulated data server: run Netcat as a data server
nc -lk 9999
Spark:
./run.sh
Push some logs on data server and see the result on Kibana
This article describes the process try to interact kafka(run by docker) in the project but some problems occur, so we still can't run kafka properly
- To run kafka docker
docker-compose -f kafka-compose.yaml up -d
- To run simple producer
python3 simple-server/__init__py data/NASA_access_log_Aug95
- To run main spark
./run_with_kafka.sh
almost them run perfectly, but seemingly we have some trouble with function withColumns
and eslasticsearch
is
incompatible with kafka
.