Skip to content

This is an academic project which aim to create a data streaming pipeline using Spark Structured Streaming, Elasticsearch and Kibana.

Notifications You must be signed in to change notification settings

sLeeNguyen/spark-logs-analysis

Repository files navigation

Spark Web Logs Analysis

This is an academic project which aim to create a data streaming pipeline using Spark Structured Streaming, Elasticsearch and Kibana. The project uses real-world production logs from NASA then process it using Spark Structured Streaming. The output is stored on Elasticsearch and visualized in a Dashboard using Kibana.

Resources:

References:

This project also inspired by several similar projects:

Required:

  • Python 3.8
  • docker, docker-compose

Set up

# Virtual environment
python3 -m venv ./venv
source ./venv/bin/activate

# Install libraries
pip install -r requirements.txt

Run

Create .env file follow the format in .env.sample

cp .env.sample .env

Download NASA logs data:

chmod +x *.sh // make files executable
./prepare.sh // download

Elasticsearch and Kibana:

docker-compose -f elasticsearch-kibana-compose.yaml up

Note: Elasticsearch node runs on port 9200 (ES_PORT in .env) (https://localhost:9200), Kibana 5601 (https://localhost:5601) (KIBANA_PORT in .env).

Simulated data server: run Netcat as a data server

nc -lk 9999

Spark:

./run.sh

Push some logs on data server and see the result on Kibana

Kafka

This article describes the process try to interact kafka(run by docker) in the project but some problems occur, so we still can't run kafka properly

  • To run kafka docker
docker-compose -f kafka-compose.yaml up -d
  • To run simple producer
python3 simple-server/__init__py data/NASA_access_log_Aug95
  • To run main spark
./run_with_kafka.sh

almost them run perfectly, but seemingly we have some trouble with function withColumns and eslasticsearch is incompatible with kafka.

About

This is an academic project which aim to create a data streaming pipeline using Spark Structured Streaming, Elasticsearch and Kibana.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published