A standard base spider structure to use with any kind of spider. we are scraping articles using python scrapy framework.
Scraping Tools Setup on Ubuntu Reference Environment: • Ubuntu 14.04.1 LTS • Python 2.7.6 • OpenSSL 1.0.1f • Twistd 13.2.0
- Python 2.7: Check by running command
python --version. - Install if missing. • OpenSSL: Check by running command
openssl version. - Install if missing. • Twisted Framework: Check by running
twistd --version.
- Run “sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7”
- Run “echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list”
- Run “sudo apt-get update && sudo apt-get install scrapy-0.24”. When asked, enter “Y” to installed extra required packages. Confirm that Scrapy is installed successfully by running “scrapy version”.
- Download zip from ``https://github.com/scrapinghub/scrapylib``` and unzip to a local folder.
- cd to that folder and run “python setup.py install”. Scrapyd Installation:
- Run
apt-get install scrapyd - Once finished, check whether
scrapydprocess is running or not. If running, kill the process (force kill if required). - Run
sudo twistd -ny /etc/scrapyd/scrapyd.tac > /var/log/scrapyd/scrapyd.log 2>&1 &. Confirm thattwistdprocess is running. - Go to
http://SERVER-IP:6800to confirm that Scrapyd web console is displayed. Note: if you found any issue. Please check attached file and permission.
- step 1: check services if scrapyd running then kill scrapyd.
- Step 2:
sudo twistd -ny /etc/scrapyd/scrapyd.tac > /var/log/scrapyd/scrapyd.log 2>&1 & - check server :
http://localhost:6800/ - step 3: run curl in sequence. I.
cd /article-scraping/scraping curl http://localhost:6800/schedule.json -d project=scraping -d spider=basePaging- Go through the
url : http://localhost:6800/and check job running or not. Do not run second curl until first job not finished. - when first job finished check csv creted or not if creted run second curl.
curl http://localhost:6800/schedule.json -d project=scraping -d spider=rookiestewArticleUrlSpider(when second job finished check csv creted or not if creted run third curl.)curl http://localhost:6800/schedule.json -d project=scraping -d spider=rookiestewDetailSpider- For mail set your id and password