PAP search

Description

Scrapes public administration publications information and stores it in an ElasticSearch Instance. Currently supports Diario oficial de Galicia (DOGA) publications

Setup

Create a Virtual Environment

python -m venv papenv # On Mac/Linux use Python3

Activate your Virtual Environment

papenv\Scripts\activate # On Windows
source papenv/bin/activate # On Mac/Linux

Install project dependencies

pip install -r requirements.txt

Get a list of initial pages to configure the crawler. You could use this script to generate pages from the current year.

python define_start_urls.py # On Mac/Linux use Python3

it will store a bunch of urls inside "data/start_urls.json" to access current year DOGa documents

Crawl

To execute the crawler run the following command:

scrapy crawl doga_spider

It will crawl the seed url's from "data/DOGA_start_urls.json". After its execution, you could find the file "data/TMP_output.json" containing a dictionary of elements You'll have to manually rename this file to "data/DOGA_output.json".

Store data

The options to deploy a development setup are:

Execute a Elastic Search container

docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0

Run a Elastic search instance

Download ElasticSearch

Since version 8 uses https by default, this could be modified editing the configuration file config/elasticsearch.yml and adding to the bottom the following directives.

xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.security.http.ssl.enabled: false

To store the scrapped documents in ElasticSearch run the command:

python bulk_post_documents.py # On Mac/Linux use Python3

Run webapp

There's also a client to consume the stored data, check the PAP Search Client repository for instructions of how to execute it !!

scrapy genspider boe_spider boe.es

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
pap		pap
papenv		papenv
.gitignore		.gitignore
README.md		README.md
bulk_post_documents.py		bulk_post_documents.py
crawl_and_post.bat		crawl_and_post.bat
define_start_urls.py		define_start_urls.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PAP search

Description

Setup

Crawl

Store data

Run webapp

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pablomarino/pap-search

Folders and files

Latest commit

History

Repository files navigation

PAP search

Description

Setup

Crawl

Store data

Run webapp

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages