Create a Crawler tool to build a corpus #7

sehsanm · 2018-12-03T10:50:55Z

Create a Crawler tool to collect information from a set of websites and links to build a corpus

abb4s · 2018-12-03T11:40:50Z

I've created a simple one by 'scrapy' , you need to install scrapy and run this in directory of project
scrapy crawl quotes -o ham.json . and also you can change websites urls for crawling in 'spiders/quotes_spiders.py' . I found scrapy useful because it have features like html parsing , defining request rate, avoiding duplicate urls and ... . but we have some requirements that I think we should solve : 1.we want just Persian words 2.there are many phrases in websites which are not sentences and we don't want them in our corpus . 3.there are many repetitive phrases in all pages of each websits.

the scrapy crawler for hamshahri :
nlpcrawler.zip

sehsanm · 2018-12-03T12:41:55Z

Perfect this is kind of an implementation of this story. Can you move this to repository and write a small documentation for it?

Please also not that we need to have a corpus of size X10 million sentences

abb4s · 2018-12-03T16:46:43Z

I've sent you a pull request. I don't think that it would be good way to control sentences count while crawling . but we can limit size of crawled text in GB measure. I think 3GB is enough size for 10M sentences .

sehsanm added the CORPUS label Dec 3, 2018

sehsanm added this to the Course Project milestone Dec 3, 2018

sehsanm changed the title ~~Create a Crawler tool to collect information from a set of websites and links to build a corpus~~ Create a Crawler tool to build a corpus Dec 3, 2018

sehsanm assigned sehsanm and abb4s and unassigned sehsanm Dec 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a Crawler tool to build a corpus #7

Create a Crawler tool to build a corpus #7

sehsanm commented Dec 3, 2018 •

edited

Loading

abb4s commented Dec 3, 2018

sehsanm commented Dec 3, 2018

abb4s commented Dec 3, 2018

Create a Crawler tool to build a corpus #7

Create a Crawler tool to build a corpus #7

Comments

sehsanm commented Dec 3, 2018 • edited Loading

abb4s commented Dec 3, 2018

sehsanm commented Dec 3, 2018

abb4s commented Dec 3, 2018

sehsanm commented Dec 3, 2018 •

edited

Loading