Skip to content

ranvijay-sachan/scrap-full-justdial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

scrap-full-justdial

A standard base spider structure to use with any kind of spider. we are scraping articles using python scrapy framework.

Scraping Tools Setup on Ubuntu Reference Environment: • Ubuntu 14.04.1 LTS • Python 2.7.6 • OpenSSL 1.0.1f • Twistd 13.2.0

Pre-requisites:

  • Python 2.7: Check by running command python --version.
  • Install if missing. • OpenSSL: Check by running command openssl version.
  • Install if missing. • Twisted Framework: Check by running twistd --version.

Scrapy Installation:

  • Run “sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7”
  • Run “echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list”
  • Run “sudo apt-get update && sudo apt-get install scrapy-0.24”. When asked, enter “Y” to installed extra required packages. Confirm that Scrapy is installed successfully by running “scrapy version”.

Scrapylib Installation (Master lib required for using Crawlera):

  • Download zip from ``https://github.com/scrapinghub/scrapylib``` and unzip to a local folder.
  • cd to that folder and run “python setup.py install”. Scrapyd Installation:
  • Run apt-get install scrapyd
  • Once finished, check whether scrapyd process is running or not. If running, kill the process (force kill if required).
  • Run sudo twistd -ny /etc/scrapyd/scrapyd.tac > /var/log/scrapyd/scrapyd.log 2>&1 &. Confirm that twistd process is running.
  • Go to http://SERVER-IP:6800 to confirm that Scrapyd web console is displayed. Note: if you found any issue. Please check attached file and permission.

Run curl:

  • step 1: check services if scrapyd running then kill scrapyd.
  • Step 2: sudo twistd -ny /etc/scrapyd/scrapyd.tac > /var/log/scrapyd/scrapyd.log 2>&1 &
  • check server : http://localhost:6800/
  • step 3: run curl in sequence. I. cd /article-scraping/scraping
  • curl http://localhost:6800/schedule.json -d project=scraping -d spider=basePaging
  • Go through the url : http://localhost:6800/ and check job running or not. Do not run second curl until first job not finished.
  • when first job finished check csv creted or not if creted run second curl.
  • curl http://localhost:6800/schedule.json -d project=scraping -d spider=rookiestewArticleUrlSpider (when second job finished check csv creted or not if creted run third curl.)
  • curl http://localhost:6800/schedule.json -d project=scraping -d spider=rookiestewDetailSpider
  • For mail set your id and password

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages