Scrapes public administration publications information and stores it in an ElasticSearch Instance. Currently supports Diario oficial de Galicia (DOGA) publications
Create a Virtual Environment
python -m venv papenv # On Mac/Linux use Python3Activate your Virtual Environment
papenv\Scripts\activate # On Windows
source papenv/bin/activate # On Mac/LinuxInstall project dependencies
pip install -r requirements.txtGet a list of initial pages to configure the crawler. You could use this script to generate pages from the current year.
python define_start_urls.py # On Mac/Linux use Python3 it will store a bunch of urls inside "data/start_urls.json" to access current year DOGa documents
To execute the crawler run the following command:
scrapy crawl doga_spiderIt will crawl the seed url's from "data/DOGA_start_urls.json". After its execution, you could find the file "data/TMP_output.json" containing a dictionary of elements You'll have to manually rename this file to "data/DOGA_output.json".
The options to deploy a development setup are:
- Execute a Elastic Search container
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0-
Run a Elastic search instance
Since version 8 uses https by default, this could be modified editing the configuration file
config/elasticsearch.ymland adding to the bottom the following directives.
xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.security.http.ssl.enabled: falseTo store the scrapped documents in ElasticSearch run the command:
python bulk_post_documents.py # On Mac/Linux use Python3 There's also a client to consume the stored data, check the PAP Search Client repository for instructions of how to execute it !!
scrapy genspider boe_spider boe.es