Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Crawler tool to build a corpus #7

Open
sehsanm opened this issue Dec 3, 2018 · 3 comments
Open

Create a Crawler tool to build a corpus #7

sehsanm opened this issue Dec 3, 2018 · 3 comments
Assignees
Labels

Comments

@sehsanm
Copy link
Owner

sehsanm commented Dec 3, 2018

Create a Crawler tool to collect information from a set of websites and links to build a corpus

@sehsanm sehsanm added the CORPUS label Dec 3, 2018
@sehsanm sehsanm added this to the Course Project milestone Dec 3, 2018
@sehsanm sehsanm changed the title Create a Crawler tool to collect information from a set of websites and links to build a corpus Create a Crawler tool to build a corpus Dec 3, 2018
@abb4s
Copy link
Collaborator

abb4s commented Dec 3, 2018

I've created a simple one by 'scrapy' , you need to install scrapy and run this in directory of project
scrapy crawl quotes -o ham.json . and also you can change websites urls for crawling in 'spiders/quotes_spiders.py' . I found scrapy useful because it have features like html parsing , defining request rate, avoiding duplicate urls and ... . but we have some requirements that I think we should solve : 1.we want just Persian words 2.there are many phrases in websites which are not sentences and we don't want them in our corpus . 3.there are many repetitive phrases in all pages of each websits.

the scrapy crawler for hamshahri :
nlpcrawler.zip

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 3, 2018

Perfect this is kind of an implementation of this story. Can you move this to repository and write a small documentation for it?

Please also not that we need to have a corpus of size X10 million sentences

@sehsanm sehsanm assigned sehsanm and abb4s and unassigned sehsanm Dec 3, 2018
@abb4s
Copy link
Collaborator

abb4s commented Dec 3, 2018

I've sent you a pull request. I don't think that it would be good way to control sentences count while crawling . but we can limit size of crawled text in GB measure. I think 3GB is enough size for 10M sentences .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants