Skip to content

Commit

Permalink
Add Spiders for Farsi Blogs (#26)
Browse files Browse the repository at this point in the history
* add farsi blogs

* modify crawler README file

* modify #2 crawler README file

* add strip function to hamshahri_spider
  • Loading branch information
Sahand504 authored and sehsanm committed Dec 18, 2018
1 parent fcc6d28 commit 5775398
Show file tree
Hide file tree
Showing 8 changed files with 115 additions and 5 deletions.
16 changes: 12 additions & 4 deletions scripts/crawler/crawler/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@

simple crawler using scrapy framework . scrapy have useful features like : avoiding duplicate urls, limitation on depth , defining request rate and ... .
simple crawler using scrapy framework . scrapy have useful features like : avoiding duplicate urls, limitation on depth , defining request rate and ... . \n
you need to install scrapy :
`pip install scrapy`
and run the crawler :
`scrapy crawl hamshahri -o ham.json`
and run the crawlers : <br />

settings can be found in :'crawler/settings.py' and 'crawler/spiders/hamshahri_spider.py'
`scrapy crawl hamshahri -o hamshahri.json` <br />
`scrapy crawl blog -o blog.json` <br />
`scrapy crawl blogfa -o blogfa.json` <br />
`scrapy crawl blogsky -o blogsky.json` <br />
`scrapy crawl dorsablog -o dorsablog.json` <br />
`scrapy crawl mihanblog -o mihanblog.json` <br />
`scrapy crawl persianblog -o persianblog.json` <br />

settings can be found in :'crawler/settings.py' <br />
spiders are available in 'crawler/spiders'


17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/blog_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "blog"
start_urls = [
'http://blog.ir/topblogs/96'
]
allowed_domains=["blog.ir"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/blogfa_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "blogfa"
start_urls = [
'https://blogfa.com/members/'
]
allowed_domains=["blogfa.com"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/blogsky_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "blogsky"
start_urls = [
'http://www.blogsky.com/posts'
]
allowed_domains=["blogsky.com"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/dorsablog_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "dorsablog"
start_urls = [
'https://dorsablog.com/update'
]
allowed_domains=["dorsablog.com"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
2 changes: 1 addition & 1 deletion scripts/crawler/crawler/spiders/hamshahri_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ class QuotesSpider(scrapy.Spider):
def parse(self, response):
for quote in response.css('p::text').extract():
yield {
'text': quote
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/mihanblog_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "mihanblog"
start_urls = [
'http://mihanblog.com/'
]
allowed_domains=["mihanblog.com"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)
17 changes: 17 additions & 0 deletions scripts/crawler/crawler/spiders/persianblog_spider.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import scrapy


class QuotesSpider(scrapy.Spider):
name = "persianblog"
start_urls = [
'https://persianblog.ir/'
]
allowed_domains=["persianblog.ir"]
def parse(self, response):
for quote in response.css('p::text').extract():
yield quote
yield {
'text': quote.strip()
}
for href in response.css('a::attr(href)').extract():
yield response.follow(href, callback=self.parse)

0 comments on commit 5775398

Please sign in to comment.