Public Data Scraper for Parliament Data for the EU and other Parliaments
- Install git (if not present already)
- Clone project using
git clone https://github.com/sampritipanda/simple_app.git
- Install Ruby (version >= 2.1) and Bundler
- Run
bundle install
to install the required gems - Run the script using
ruby eu_scraper.rb
or./eu_scraper.rb
- Find the scraped questions in the docs/ folder
- Ruby - The Language
- Nokogiri - For HTML Parsing
##Scala-based Asynchronous crawler Setup
- Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.git
sbt run
- sbt will first automatically download the necessary dependencies, and it will run the script.
###Technologies Used in Scala crawler:
- Scala: a functional programming language on JVM
- Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
- Spray-client: a light-weighted HTTP client based on Akka Actor model.
##Python Based Crawler Setup
- Install the requirements for this crawler
pip install -r requirements.txt
- Run
$ python eu_scraper.py
###Technologies Used in Python Crawler:
- Requests library
- lxml library for DOM traversal
##Python-async parser setup
- Create a virtual environment inside
python-async
folder withvirtualenv --python=python3.4 venv
- Activate you virtual environment with
source venv/bin/activate
- Install all appropriate requirements with
pip install -r requirements.txt
- Run the parser with
$ python parser.py
Changing the parser behavior
- Change
YEARS_TO_PARSE
in order to parse data from different years - Change
FOLDER_TO_DOWNLOAD
in order to change the name of the folder to download the data into.
###Technologies Used in Python-async parser:
- Requests + requests-futures for async requests
- threading for async downloading
- beautifulsoup4 for DOM parsing
- tqdm for progress bar