You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sehsanm
changed the title
Create a Crawler tool to collect information from a set of websites and links to build a corpus
Create a Crawler tool to build a corpus
Dec 3, 2018
I've created a simple one by 'scrapy' , you need to install scrapy and run this in directory of project scrapy crawl quotes -o ham.json . and also you can change websites urls for crawling in 'spiders/quotes_spiders.py' . I found scrapy useful because it have features like html parsing , defining request rate, avoiding duplicate urls and ... . but we have some requirements that I think we should solve : 1.we want just Persian words 2.there are many phrases in websites which are not sentences and we don't want them in our corpus . 3.there are many repetitive phrases in all pages of each websits.
I've sent you a pull request. I don't think that it would be good way to control sentences count while crawling . but we can limit size of crawled text in GB measure. I think 3GB is enough size for 10M sentences .
Create a Crawler tool to collect information from a set of websites and links to build a corpus
The text was updated successfully, but these errors were encountered: