Link Graph Extractor

Description

This is a crawler that only collects links of a number of domains that is specified in the configurations, stores the collected links on Neo4j.

pip3 install Scrapy kafka-python neo4j neobolt neotime PyYAML

First edit config.yml file to appropriate values for the arguments. These are Kafka, Neo4j and Scrapy's arguments.

Then run the commands below.

python kafka_consumer.py
scrapy crawl graph -s JOBDIR=<crawl-location>

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
linkgraph		linkgraph
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
config.yml		config.yml
config_loader.py		config_loader.py
kafka_consumer.py		kafka_consumer.py
neo4j_service.py		neo4j_service.py
scrapy.cfg		scrapy.cfg