Spark Web Logs Analysis

This is an academic project which aim to create a data streaming pipeline using Spark Structured Streaming, Elasticsearch and Kibana. The project uses real-world production logs from NASA then process it using Spark Structured Streaming. The output is stored on Elasticsearch and visualized in a Dashboard using Kibana.

Resources:

Data: NASA-HTTP

References:

Elasticsearch for Apache Hadoop

This project also inspired by several similar projects:

Required:

Python 3.8
docker, docker-compose

Set up

# Virtual environment
python3 -m venv ./venv
source ./venv/bin/activate

# Install libraries
pip install -r requirements.txt

Run

Create .env file follow the format in .env.sample

cp .env.sample .env

Download NASA logs data:

chmod +x *.sh // make files executable
./prepare.sh // download

Elasticsearch and Kibana:

docker-compose -f elasticsearch-kibana-compose.yaml up

Note: Elasticsearch node runs on port 9200 (ES_PORT in .env) (https://localhost:9200), Kibana 5601 (https://localhost:5601) (KIBANA_PORT in .env).

Simulated data server: run Netcat as a data server

nc -lk 9999

Spark:

./run.sh

Push some logs on data server and see the result on Kibana

Kafka

This article describes the process try to interact kafka(run by docker) in the project but some problems occur, so we still can't run kafka properly

To run kafka docker

docker-compose -f kafka-compose.yaml up -d

To run simple producer

python3 simple-server/__init__py data/NASA_access_log_Aug95

To run main spark

./run_with_kafka.sh

almost them run perfectly, but seemingly we have some trouble with function withColumns and eslasticsearch is incompatible with kafka.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
elasticsearch-hadoop-8.7.1		elasticsearch-hadoop-8.7.1
simple-server		simple-server
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
elasticsearch-kibana-compose.yaml		elasticsearch-kibana-compose.yaml
kafka-compose.yaml		kafka-compose.yaml
main.py		main.py
main_kafka.py		main_kafka.py
parse_logs.ipynb		parse_logs.ipynb
prepare.sh		prepare.sh
requirements.txt		requirements.txt
run.sh		run.sh
run_with_kafka.sh		run_with_kafka.sh
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Web Logs Analysis

Set up

Run

Kafka

About

Releases

Packages

Contributors 2

Languages

sLeeNguyen/spark-logs-analysis

Folders and files

Latest commit

History

Repository files navigation

Spark Web Logs Analysis

Set up

Run

Kafka

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages