parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

Install git (if not present already)
Clone project using git clone https://github.com/sampritipanda/simple_app.git
Install Ruby (version >= 2.1) and Bundler
Run bundle install to install the required gems
Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

Ruby - The Language
Nokogiri - For HTML Parsing

##Scala-based Asynchronous crawler Setup

Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.git
sbt run
sbt will first automatically download the necessary dependencies, and it will run the script.

###Technologies Used in Scala crawler:

Scala: a functional programming language on JVM
Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
Spray-client: a light-weighted HTTP client based on Akka Actor model.

##Python Based Crawler Setup

Install the requirements for this crawler pip install -r requirements.txt
Run $ python eu_scraper.py

###Technologies Used in Python Crawler:

Requests library
lxml library for DOM traversal

##Python-async parser setup

Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
Activate you virtual environment with source venv/bin/activate
Install all appropriate requirements with pip install -r requirements.txt
Run the parser with $ python parser.py

Changing the parser behavior

Change YEARS_TO_PARSE in order to parse data from different years
Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

###Technologies Used in Python-async parser:

Requests + requests-futures for async requests
threading for async downloading
beautifulsoup4 for DOM parsing
tqdm for progress bar

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/EUP2015		data/EUP2015
python-async		python-async
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
application.conf		application.conf
build.sbt		build.sbt
crawler.scala		crawler.scala
eu-scraper.py		eu-scraper.py
eu_scraper.rb		eu_scraper.rb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parliament-scaper

Ruby Based Crawler Setup

Technologies Used in Ruby crawler:

About

Releases

Packages

Languages

License

pythad/parliament-scaper

Folders and files

Latest commit

History

Repository files navigation

parliament-scaper

Ruby Based Crawler Setup

Technologies Used in Ruby crawler:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages